WorldWideScience

Sample records for multiple processor architectures

  1. Parallel processor simulator for multiple optic channel architectures

    Science.gov (United States)

    Wailes, Tom S.; Meyer, David G.

    1992-12-01

    A parallel processing architecture based on multiple channel optical communication is described and compared with existing interconnection strategies for parallel computers. The proposed multiple channel architecture (MCA) uses MQW-DBR lasers to provide a large number of independent, selectable channels (or virtual buses) for data transport. Arbitrary interconnection patterns as well as machine partitions can be emulated via appropriate channel assignments. Hierarchies of parallel architectures and simultaneous execution of parallel tasks are also possible. Described are a basic overview of the proposed architecture, various channel allocation strategies that can be utilized by the MCA, and a summary of advantages of the MCA compared with traditional interconnection techniques. Also describes is a comprehensive multiple processor simulator that has been developed to execute parallel algorithms using the MCA as a data transport mechanism between processors and memory units. Simulation results -- including average channel load, effective channel utilization, and average network latency for different algorithms and different transmission speeds -- are also presented.

  2. New Generation Processor Architecture Research

    Institute of Scientific and Technical Information of China (English)

    Chen Hongsong(陈红松); Hu Mingzeng; Ji Zhenzhou

    2003-01-01

    With the rapid development of microelectronics and hardware,the use of ever faster micro-processors and new architecture must be continued to meet tomorrow′s computing needs. New processor microarchitectures are needed to push performance further and to use higher transistor counts effectively.At the same time,aiming at different usages,the processor has been optimized in different aspects,such as high performace,low power consumption,small chip area and high security. SOC (System on chip)and SCMP (Single Chip Multi Processor) constitute the main processor system architecture.

  3. HONEI: A collection of libraries for numerical computations targeting multiple processor architectures

    Science.gov (United States)

    van Dyk, Danny; Geveler, Markus; Mallach, Sven; Ribbrock, Dirk; Göddeke, Dominik; Gutwenger, Carsten

    2009-12-01

    We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3-4 and 4-16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development. Program summaryProgram title: HONEI Catalogue identifier: AEDW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPLv2 No. of lines in distributed program, including test data, etc.: 216 180 No. of bytes in distributed program, including test data, etc.: 1 270 140 Distribution format: tar.gz Programming language: C++ Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3 Operating system: Linux RAM: at least 500 MB free Classification: 4.8, 4.3, 6.1 External routines: SSE: none; [1] for GPU, [2] for Cell backend Nature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the

  4. SPROC: A multiple-processor DSP IC

    Science.gov (United States)

    Davis, R.

    1991-01-01

    A large, single-chip, multiple-processor, digital signal processing (DSP) integrated circuit (IC) fabricated in HP-Cmos34 is presented. The innovative architecture is best suited for analog and real-time systems characterized by both parallel signal data flows and concurrent logic processing. The IC is supported by a powerful development system that transforms graphical signal flow graphs into production-ready systems in minutes. Automatic compiler partitioning of tasks among four on-chip processors gives the IC the signal processing power of several conventional DSP chips.

  5. Critical review of programmable media processor architectures

    Science.gov (United States)

    Berg, Stefan G.; Sun, Weiyun; Kim, Donglok; Kim, Yongmin

    1998-12-01

    In the past several years, there has been a surge of new programmable mediaprocessors introduced to provide an alternative solution to ASICs and dedicated hardware circuitries in the multimedia PC and embedded consumer electronics markets. These processors attempt to combine the programmability of multimedia-enhanced general purpose processors with the performance and low cost of dedicated hardware. We have reviewed five current multimedia architectures and evaluated their strengths and weaknesses.

  6. Floating-point multiple data stream digital signal processor

    Energy Technology Data Exchange (ETDEWEB)

    Fortier, M.; Corinthios, M.J.

    1982-01-01

    A microprogrammed multiple data stream digital signal processor is introduced. This floating-point processor is capable of implementing optimum Wiener filtering of signals, in general, and images in particular. Generalised spectral analysis transforms such as Fourier, Walsh, Hadamard, and generalised Walsh are efficiently implemented in a bit-slice microprocessor-based architecture. In this architecture, a microprogrammed sequencing section directly controls a central floating-point signal processing unit. Throughout, computations are performed on pipelined multiple complex data streams. 12 references.

  7. Optical linear algebra processors - Architectures and algorithms

    Science.gov (United States)

    Casasent, David

    1986-01-01

    Attention is given to the component design and optical configuration features of a generic optical linear algebra processor (OLAP) architecture, as well as the large number of OLAP architectures, number representations, algorithms and applications encountered in current literature. Number-representation issues associated with bipolar and complex-valued data representations, high-accuracy (including floating point) performance, and the base or radix to be employed, are discussed, together with case studies on a space-integrating frequency-multiplexed architecture and a hybrid space-integrating and time-integrating multichannel architecture.

  8. Optical linear algebra processors - Architectures and algorithms

    Science.gov (United States)

    Casasent, David

    1986-01-01

    Attention is given to the component design and optical configuration features of a generic optical linear algebra processor (OLAP) architecture, as well as the large number of OLAP architectures, number representations, algorithms and applications encountered in current literature. Number-representation issues associated with bipolar and complex-valued data representations, high-accuracy (including floating point) performance, and the base or radix to be employed, are discussed, together with case studies on a space-integrating frequency-multiplexed architecture and a hybrid space-integrating and time-integrating multichannel architecture.

  9. Intrusion Detection Architecture Utilizing Graphics Processors

    Directory of Open Access Journals (Sweden)

    Branislav Madoš

    2012-12-01

    Full Text Available With the thriving technology and the great increase in the usage of computer networks, the risk of having these network to be under attacks have been increased. Number of techniques have been created and designed to help in detecting and/or preventing such attacks. One common technique is the use of Intrusion Detection Systems (IDS. Today, number of open sources and commercial IDS are available to match enterprises requirements. However, the performance of these systems is still the main concern. This paper examines perceptions of intrusion detection architecture implementation, resulting from the use of graphics processor. It discusses recent research activities, developments and problems of operating systems security. Some exploratory evidence is presented that shows capabilities of using graphical processors and intrusion detection systems. The focus is on how knowledge experienced throughout the graphics processor inclusion has played out in the design of intrusion detection architecture that is seen as an opportunity to strengthen research expertise.

  10. Soft-core dataflow processor architecture optimised for radar signal processing: Article

    CSIR Research Space (South Africa)

    Broich, R

    2014-10-01

    Full Text Available an iterative design methodology to propose a novel softcore streaming processor architecture. The datapaths of this architecture are arranged in a circular pattern, with multiple operands simultaneously flowing between switching multiplexers and functional...

  11. Advanced Multiple Processor Configuration Study. Final Report.

    Science.gov (United States)

    Clymer, S. J.

    This summary of a study on multiple processor configurations includes the objectives, background, approach, and results of research undertaken to provide the Air Force with a generalized model of computer processor combinations for use in the evaluation of proposed flight training simulator computational designs. An analysis of a real-time flight…

  12. Temporal Partitioning and Multi-Processor Scheduling for Reconfigurable Architectures

    DEFF Research Database (Denmark)

    Popp, Andreas; Le Moullec, Yannick; Koch, Peter

    This poster presentation outlines a proposed framework for handling mapping of signal processing applications to heterogeneous reconfigurable architectures. The methodology consists of an extension to traditional multi-processor scheduling by creating a separate HW track for generation of groups...... of tasks that are handled similarly to SW processes in a traditional multi-processor scheduling context....

  13. MICROTHREAD BASED (MTB) COARSE GRAINED FAULT TOLERANCE SUPERSCALAR PROCESSOR ARCHITECTURE

    Institute of Scientific and Technical Information of China (English)

    2006-01-01

    Fault tolerance in microprocessor systems has become a popular topic of architecture research.Much work has been done at different levels to accomplish reliability against soft errors, and some fault tolerance architectures have been proposed. But little attention is paid to the thread level superscalar fault tolerance.This letter introduces microthread concept into superscalar processor fault tolerance domain, and puts forward a novel fault tolerance architecture, namely, MicroThread Based (MTB) coarse grained transient fault tolerance superscalar processor architecture, then discusses some detailed implementations.

  14. Acoustooptic linear algebra processors - Architectures, algorithms, and applications

    Science.gov (United States)

    Casasent, D.

    1984-01-01

    Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

  15. Acoustooptic linear algebra processors - Architectures, algorithms, and applications

    Science.gov (United States)

    Casasent, D.

    1984-01-01

    Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

  16. FPGA Based Intelligent Co-operative Processor in Memory Architecture

    DEFF Research Database (Denmark)

    Ahmed, Zaki; Sotudeh, Reza; Hussain, Dil Muhammad Akbar

    2011-01-01

    In a continuing effort to improve computer system performance, Processor-In-Memory (PIM) architecture has emerged as an alternative solution. PIM architecture incorporates computational units and control logic directly on the memory to provide immediate access to the data. To exploit the potentia...

  17. FPGA Based Intelligent Co-operative Processor in Memory Architecture

    DEFF Research Database (Denmark)

    Ahmed, Zaki; Sotudeh, Reza; Hussain, Dil Muhammad Akbar

    2011-01-01

    In a continuing effort to improve computer system performance, Processor-In-Memory (PIM) architecture has emerged as an alternative solution. PIM architecture incorporates computational units and control logic directly on the memory to provide immediate access to the data. To exploit the potential...

  18. Design and Implementation of Quintuple Processor Architecture Using FPGA

    Directory of Open Access Journals (Sweden)

    P.Annapurna

    2014-09-01

    Full Text Available The advanced quintuple processor core is a design philosophy that has become a mainstream in Scientific and engineering applications. Increasing performance and gate capacity of recent FPGA devices permit complex logic systems to be implemented on a single programmable device. The embedded multiprocessors face a new problem with thread synchronization. It is caused by the distributed memory, when thread synchronization is violated the processors can access the same value at the same time. Basically the processor performance can be increased by adopting clock scaling technique and micro architectural Enhancements. Therefore, Designed a new Architecture called Advanced Concurrent Computing. This is implemented on the FPGA chip using VHDL. The advanced Concurrent Computing architecture performs a simultaneous use of both parallel and distributed computing. The full architecture of quintuple processor core designed for realistic to perform arithmetic, logical, shifting and bit manipulation operations. The proposed advanced quintuple processor core contains Homogeneous RISC processors, added with pipelined processing units, multi bus organization and I/O ports along with the other functional elements required to implement embedded SOC solutions. The designed quintuple performance issues like area, speed and power dissipation and propagation delay are analyzed at 90nm process technology using Xilinx tool.

  19. An efficient hardware architecture of a scalable elliptic curve crypto-processor over GF(2n)

    Science.gov (United States)

    Tawalbeh, Lo'ai; Tenca, Alexandre; Park, Song; Koc, Cetin

    2005-08-01

    This paper presents a scalable Elliptic Curve Crypto-Processor (ECCP) architecture for computing the point multiplication for curves defined over the binary extension fields (GF(2n)). This processor computes modular inverse and Montgomery modular multiplication using a new effcient algorithm. The scalability feature of the proposed crypto-processor allows a fixed-area datapath to handle operands of any size. Also, the word size of the datapath can be adjusted to meet the area and performance requirements. On the other hand, the processor is reconfigurable in the sense that the user has the ability to choose the value of the field parameter (n). Experimental results show that the proposed crypto-processor is competitive with many other previous designs.

  20. FPGA wavelet processor design using language for instruction-set architectures (LISA)

    Science.gov (United States)

    Meyer-Bäse, Uwe; Vera, Alonzo; Rao, Suhasini; Lenk, Karl; Pattichis, Marios

    2007-04-01

    The design of an microprocessor is a long, tedious, and error-prone task consisting of typically three design phases: architecture exploration, software design (assembler, linker, loader, profiler), architecture implementation (RTL generation for FPGA or cell-based ASIC) and verification. The Language for instruction-set architectures (LISA) allows to model a microprocessor not only from instruction-set but also from architecture description including pipelining behavior that allows a design and development tool consistency over all levels of the design. To explore the capability of the LISA processor design platform a.k.a. CoWare Processor Designer we present in this paper three microprocessor designs that implement a 8/8 wavelet transform processor that is typically used in today's FBI fingerprint compression scheme. We have designed a 3 stage pipelined 16 bit RISC processor (NanoBlaze). Although RISC μPs are usually considered "fast" processors due to design concept like constant instruction word size, deep pipelines and many general purpose registers, it turns out that DSP operations consume essential processing time in a RISC processor. In a second step we have used design principles from programmable digital signal processor (PDSP) to improve the throughput of the DWT processor. A multiply-accumulate operation along with indirect addressing operation were the key to achieve higher throughput. A further improvement is possible with today's FPGA technology. Today's FPGAs offer a large number of embedded array multipliers and it is now feasible to design a "true" vector processor (TVP). A multiplication of two vectors can be done in just one clock cycle with our TVP, a complete scalar product in two clock cycles. Code profiling and Xilinx FPGA ISE synthesis results are provided that demonstrate the essential improvement that a TVP has compared with traditional RISC or PDSP designs.

  1. A novel VLSI processor architecture for supercomputing arrays

    Science.gov (United States)

    Venkateswaran, N.; Pattabiraman, S.; Devanathan, R.; Ahmed, Ashaf; Venkataraman, S.; Ganesh, N.

    1993-01-01

    Design of the processor element for general purpose massively parallel supercomputing arrays is highly complex and cost ineffective. To overcome this, the architecture and organization of the functional units of the processor element should be such as to suit the diverse computational structures and simplify mapping of complex communication structures of different classes of algorithms. This demands that the computation and communication structures of different class of algorithms be unified. While unifying the different communication structures is a difficult process, analysis of a wide class of algorithms reveals that their computation structures can be expressed in terms of basic IP,IP,OP,CM,R,SM, and MAA operations. The execution of these operations is unified on the PAcube macro-cell array. Based on this PAcube macro-cell array, we present a novel processor element called the GIPOP processor, which has dedicated functional units to perform the above operations. The architecture and organization of these functional units are such to satisfy the two important criteria mentioned above. The structure of the macro-cell and the unification process has led to a very regular and simpler design of the GIPOP processor. The production cost of the GIPOP processor is drastically reduced as it is designed on high performance mask programmable PAcube arrays.

  2. Message-Driven Processor Architecture Version 11

    Science.gov (United States)

    1988-08-18

    UNCLASSIFIED . $CUUIT. v A$SIf9CAYON Or IMIS SAGE ’Whlken Dese E,...’lld) __ REPO_Or T CU NT PAGE ateREAD INSTRUCTIONS REPORT DOCUmtNTATION PAGE...fields instead of 2. This reflects the change in machine topology from 2D to 3D . Also, the NNR is no longer set to zero on a reset; it is left to...an X field, a Y field and a Z field indicating the position of the node in the 3D network grid. Its value identifies the processor on the network and

  3. Architecture and Design of Medical Processor Units for Medical Networks

    CERN Document Server

    Ahamed, Syed V; 10.5121/ijcnc.2010.2602

    2011-01-01

    This paper introduces analogical and deductive methodologies for the design medical processor units (MPUs). From the study of evolution of numerous earlier processors, we derive the basis for the architecture of MPUs. These specialized processors perform unique medical functions encoded as medical operational codes (mopcs). From a pragmatic perspective, MPUs function very close to CPUs. Both processors have unique operation codes that command the hardware to perform a distinct chain of subprocesses upon operands and generate a specific result unique to the opcode and the operand(s). In medical environments, MPU decodes the mopcs and executes a series of medical sub-processes and sends out secondary commands to the medical machine. Whereas operands in a typical computer system are numerical and logical entities, the operands in medical machine are objects such as such as patients, blood samples, tissues, operating rooms, medical staff, medical bills, patient payments, etc. We follow the functional overlap betw...

  4. Reversible machine code and its abstract processor architecture

    DEFF Research Database (Denmark)

    Axelsen, Holger Bock; Glück, Robert; Yokoyama, Tetsuo

    2007-01-01

    A reversible abstract machine architecture and its reversible machine code are presented and formalized. For machine code to be reversible, both the underlying control logic and each instruction must be reversible. A general class of machine instruction sets was proven to be reversible, building ...... on our concept of reversible updates. The presentation is abstract and can serve as a guideline for a family of reversible processor designs. By example, we illustrate programming principles for the abstract machine architecture formalized in this paper....

  5. Design of Variable Width Barrel Shifter for High Speed Processor Architecture

    Directory of Open Access Journals (Sweden)

    Rajeev Kumar

    2012-04-01

    Full Text Available Microprocessor is the brain of the computer. It works as the Central Processing Unit of the computer. It contains Arithmetic Logical Unit (ALU that performs the arithmetic operations such as Addition, Subtraction, Multiplication and Division. It also performs the Logical operations such as AND, NAND, OR, NOR, EXOR, EXNOR and NOT. It also contains register file to store the operand in load/store instructions in RISC Processor Architecture. Control Unit genetares the control signals that synchronize the operation of the processor which tells the microarchitecture which operation is done at which time. Now during the multiplication partial product is shifted and added. So shifter is an important part of the processor architecture. Barrel Shifter is an important combinational logic block. It was incorporated in 386 processor and is also used in microcontroller design. Intel has since moved to software implemented shifters in the Pentium 4 Processor Architecture but AMD still uses it. Here the design of the variable width barrel shifter is presented in which we can shift 4bit, 8bit, 16bit, and 32bit and maximum 64bit partial product during multiplication. Functionality is check using Modelsim 6.4a.Now to generate the gate level netlist Xilinx ISE 9.2i is used.

  6. Design and Realization of Array Signal Processor VLSI Architecture for Phased Array System

    Directory of Open Access Journals (Sweden)

    D. Govind Rao

    2016-08-01

    Full Text Available A method for implementing an array signal processor for phased array radars. The array signal processor can receive planar array antenna inputs and can process it. It is based on the application of Adaptive Digital beam formers using FPGAs. Adaptive filter algorithm used here is Inverse Q-R Decomposition based Recursive Least Squares (IQRD-RLS [1] algorithm. Array signal processor based on FPGAs is suitable in the areas of Phased Array Radar receiver, where speed, accuracy and numerical stability are of utmost important. Using IQRD-RLS algorithm, optimal weights are calculated in much less time compared to conventional QRD-RLS algorithm. A customized multiple FPGA board comprising three Kintex-7 FPGAs is employed to implement array signal processor. The proposed architecture can form multiple beams from planar array antenna elements

  7. High-performance hardware architecture of elliptic curve cryptography processor over GF(2163)

    Institute of Scientific and Technical Information of China (English)

    Yong-ping DAN; Xue-cheng ZOU; Zheng-lin LIU; Yu HAN; Li-hua YI

    2009-01-01

    We propose a novel high-performance hardware architecture of processor for elliptic curve scalar multiplication based on the Lopez-Dahab algorithm over GF(2163) in polynomial basis representation. The processor can do all the operations using an efficient modular arithmetic logic unit, which includes an addition unit, a square and a carefully designed multiplication unit. In the proposed architecture, multiplication, addition, and square can be performed in parallel by the decomposition of computation. The point addition and point doubling iteration operations can be performed in six multiplications by optimization and solution of data dependency. The implementation results based on Xilinx Virtexll XC2V6000 FPGA show that the proposed design can do random elliptic curve scalar multiplication GF(2163) in 34.11 μs, occupying 2821 registers and 13 376 LUTs.

  8. Multiple core computer processor with globally-accessible local memories

    Energy Technology Data Exchange (ETDEWEB)

    Shalf, John; Donofrio, David; Oliker, Leonid

    2016-09-20

    A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality of processor cores.

  9. Scalable architecture for a room temperature solid-state quantum information processor.

    Science.gov (United States)

    Yao, N Y; Jiang, L; Gorshkov, A V; Maurer, P C; Giedke, G; Cirac, J I; Lukin, M D

    2012-04-24

    The realization of a scalable quantum information processor has emerged over the past decade as one of the central challenges at the interface of fundamental science and engineering. Here we propose and analyse an architecture for a scalable, solid-state quantum information processor capable of operating at room temperature. Our approach is based on recent experimental advances involving nitrogen-vacancy colour centres in diamond. In particular, we demonstrate that the multiple challenges associated with operation at ambient temperature, individual addressing at the nanoscale, strong qubit coupling, robustness against disorder and low decoherence rates can be simultaneously achieved under realistic, experimentally relevant conditions. The architecture uses a novel approach to quantum information transfer and includes a hierarchy of control at successive length scales. Moreover, it alleviates the stringent constraints currently limiting the realization of scalable quantum processors and will provide fundamental insights into the physics of non-equilibrium many-body quantum systems.

  10. Behavioral Simulation and Performance Evaluation of Multi-Processor Architectures

    Directory of Open Access Journals (Sweden)

    Ausif Mahmood

    1996-01-01

    Full Text Available The development of multi-processor architectures requires extensive behavioral simulations to verify the correctness of design and to evaluate its performance. A high level language can provide maximum flexibility in this respect if the constructs for handling concurrent processes and a time mapping mechanism are added. This paper describes a novel technique for emulating hardware processes involved in a parallel architecture such that an object-oriented description of the design is maintained. The communication and synchronization between hardware processes is handled by splitting the processes into their equivalent subprograms at the entry points. The proper scheduling of these subprograms is coordinated by a timing wheel which provides a time mapping mechanism. Finally, a high level language pre-processor is proposed so that the timing wheel and the process emulation details can be made transparent to the user.

  11. Advanced Avionics and Processor Systems for a Flexible Space Exploration Architecture

    Science.gov (United States)

    Keys, Andrew S.; Adams, James H.; Smith, Leigh M.; Johnson, Michael A.; Cressler, John D.

    2010-01-01

    The Advanced Avionics and Processor Systems (AAPS) project, formerly known as the Radiation Hardened Electronics for Space Environments (RHESE) project, endeavors to develop advanced avionic and processor technologies anticipated to be used by NASA s currently evolving space exploration architectures. The AAPS project is a part of the Exploration Technology Development Program, which funds an entire suite of technologies that are aimed at enabling NASA s ability to explore beyond low earth orbit. NASA s Marshall Space Flight Center (MSFC) manages the AAPS project. AAPS uses a broad-scoped approach to developing avionic and processor systems. Investment areas include advanced electronic designs and technologies capable of providing environmental hardness, reconfigurable computing techniques, software tools for radiation effects assessment, and radiation environment modeling tools. Near-term emphasis within the multiple AAPS tasks focuses on developing prototype components using semiconductor processes and materials (such as Silicon-Germanium (SiGe)) to enhance a device s tolerance to radiation events and low temperature environments. As the SiGe technology will culminate in a delivered prototype this fiscal year, the project emphasis shifts its focus to developing low-power, high efficiency total processor hardening techniques. In addition to processor development, the project endeavors to demonstrate techniques applicable to reconfigurable computing and partially reconfigurable Field Programmable Gate Arrays (FPGAs). This capability enables avionic architectures the ability to develop FPGA-based, radiation tolerant processor boards that can serve in multiple physical locations throughout the spacecraft and perform multiple functions during the course of the mission. The individual tasks that comprise AAPS are diverse, yet united in the common endeavor to develop electronics capable of operating within the harsh environment of space. Specifically, the AAPS tasks for

  12. Cooperative Computing Techniques for a Deeply Fused and Heterogeneous Many-Core Processor Architecture

    Institute of Scientific and Technical Information of China (English)

    郑方; 李宏亮; 吕晖; 过锋; 许晓红; 谢向辉

    2015-01-01

    Due to advances in semiconductor techniques, many-core processors have been widely used in high performance computing. However, many applications still cannot be carried out efficiently due to the memory wall, which has become a bottleneck in many-core processors. In this paper, we present a novel heterogeneous many-core processor architecture named deeply fused many-core (DFMC) for high performance computing systems. DFMC integrates management processing ele-ments (MPEs) and computing processing elements (CPEs), which are heterogeneous processor cores for different application features with a unified ISA (instruction set architecture), a unified execution model, and share-memory that supports cache coherence. The DFMC processor can alleviate the memory wall problem by combining a series of cooperative computing techniques of CPEs, such as multi-pattern data stream transfer, efficient register-level communication mechanism, and fast hardware synchronization technique. These techniques are able to improve on-chip data reuse and optimize memory access performance. This paper illustrates an implementation of a full system prototype based on FPGA with four MPEs and 256 CPEs. Our experimental results show that the effect of the cooperative computing techniques of CPEs is significant, with DGEMM (double-precision matrix multiplication) achieving an efficiency of 94%, FFT (fast Fourier transform) obtaining a performance of 207 GFLOPS and FDTD (finite-difference time-domain) obtaining a performance of 27 GFLOPS.

  13. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    Science.gov (United States)

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.

  14. Architecture-level performance/power tradeoff in network processor design

    Institute of Scientific and Technical Information of China (English)

    CHEN Hong-song; JI Zhen-zhou; HU Ming-zeng

    2007-01-01

    Network processors are used in the core node of network to flexibly process packet streams. With the increase of performance, the power of network processor increases fast, and power and cooling become a bottleneck. Architecture-level power conscious design must go beyond low-level circuit design. Architectural power and performance tradeoff should be considered at the same time. Simulation is an efficient method to design modern network processor before making chip. In order to achieve the tradeoff between performance and power,the processor simulator is used to design the architecture of network processor. Using Netbench, Commubench benchmark and processor simulator-SimpleScalar, the performance and power of network processor are quantitatively evaluated. New performance tradeoff evaluation metric is proposed to analyze the architecture of network processor. Based on the high performance Intel IXP 2800 Network processor configuration, optimized instruction fetch width and speed 、instruction issue width, instruction window size are analyzed and selected. Simulation results show that the tradeoff design method makes the usage of network processor more effectively. The optimal key parameters of network processor are important in architecture-level design. It is meaningful for the next generation network processor design.

  15. FY1995 study of design methodology and environment of high-performance processor architectures; 1995 nendo koseino processor architecture sekkeiho to sekkei kankyo no kenkyu

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1997-03-01

    The aim of our project is to develop high-performance processor architectures for both general purpose and application-specific purpose. We also plan to develop basic softwares, such as compliers, and various design aid tools for those architectures. We are particularly interested in performance evaluation at architecture design phase, design optimization, automatic generation of compliers from processor designs, and architecture design methodologies combined with circuit layout. We have investigated both microprocessor architectures and design methodologies / environments for the processors. Our goal is to establish design technologies for high-performance, low-power, low-cost and highly-reliable systems in system-on-silicon era. We have proposed PPRAM architecture for high-performance system using DRAM and logic mixture technology, Softcore processor architecture for special purpose processors in embedded systems, and Power-Pro architecture for low power systems. We also developed design methodologies and design environments for the above architectures as well as a new method for design verification of microprocessors. (NEDO)

  16. Distributed processor allocation for launching applications in a massively connected processors complex

    Science.gov (United States)

    Pedretti, Kevin

    2008-11-18

    A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

  17. Directions in parallel processor architecture, and GPUs too

    CERN Document Server

    CERN. Geneva

    2014-01-01

    Modern computing is power-limited in every domain of computing. Performance increments extracted from instruction-level parallelism (ILP) are no longer power-efficient; they haven't been for some time. Thread-level parallelism (TLP) is a more easily exploited form of parallelism, at the expense of programmer effort to expose it in the program. In this talk, I will introduce you to disparate topics in parallel processor architecture that will impact programming models (and you) in both the near and far future. About the speaker Olivier is a senior GPU (SM) architect at NVIDIA and an active participant in the concurrency working group of the ISO C++ committee. He has also worked on very large diesel engines as a mechanical engineer, and taught at McGill University (Canada) as a faculty instructor.

  18. Digital signal processor and its application. ; Animation processing DSP architectures. Digital signal processor to sono oyo. ; Dogazo shoriyo DSP architecture

    Energy Technology Data Exchange (ETDEWEB)

    Murakami, T.; Ohira, H. (Mitsubishi Electric Corp., Tokyo (Japan))

    1991-12-20

    A description is given on the internationally standardized animation coding system, and existing and next generation type image processing digital signal processor (DSP) architectures. The internationally standardized animation coding system stratifies images into segments of picture element, frame, and block, with each stratum given exclusive processing. A TV conference and a TV telephone conversation require a huge amount of animation image data. To process these data on a real-time basis, the current video image processing system takes a multi DSP configuration. Methods to split the loads and allocate each load fixedly to each DSP are classified into splitting the loads in the units of coding processing function, the object images, and the loads according to amounts of loads to be calculated. These splitting methods are applied to each stratum processing. The process and splitting corresponding to each stratum processing improved the efficiency. Since the method is a software-based processing, it can be applied not only to the irternationally standardized system, but also to the vector quantization system. Although the present LSI technology is not sufficiently capable to mount the architectures meeting the stratified configuration on one chip, an architecture that specializes the functions in each stratum has a possibility to serve as one chip DSP. 14 refs., 5 figs., 4 tabs.

  19. A multiple floating point coprocessor architecture

    Energy Technology Data Exchange (ETDEWEB)

    Rauchwerger, L.; Farmwald, M.P. (Center for Supercomputing Research and Development, Univ. of Illinois at Urbana-Champaign, 305 Talbot Lab., Urbana, IL (US))

    1990-06-01

    General purpose microprocessor based computers usually speed their arithmetic processing performance by using a floating point co-processor. Because adding more co-processors represents neither a technological nor a cost problem the authors investigated a system based on a MIPS R2000 (2) and 4 floating point units. In this paper they show a block diagram of such an implementation and how two important scientific operations can be accelerated using a single unmodified data bus. A large percentage of the engineering applications are solved with the help of linear algebra methods like BLAS3 (4) algorithms; it is precisely for these primitives that the proposed architecture brings significant performance gains. The first operation described is a matrix multiplication algorithm, its timing diagram and some results. Next a polynomial evaluation technique is examined. The authors show how to use the same ideas with various other microprocessors.

  20. DFT algorithms for bit-serial GaAs array processor architectures

    Science.gov (United States)

    Mcmillan, Gary B.

    1988-01-01

    Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

  1. Novel CCD image processor for Z-plane architecture

    Science.gov (United States)

    Kemeny, S. E.; Eid, E.-S.; Fossum, E. R.

    1989-09-01

    The use of charge-coupled device (CCD) circuits in Z-plane architectures for focal-plane image processing is discussed. The low-power, compact layout nature of CCDs makes them attractive for Z-plane application. Three application areas are addressed: non-uniformity compensation using CCD MDAC circuits, neighborhood image processing functions implemented with CCD circuits, and the use of CCDs for buffering multiple image frames. Such buffering enables spatial-temporal image transformation for lossless compression.

  2. PVM Enhancement for Beowulf Multiple-Processor Nodes

    Science.gov (United States)

    Springer, Paul

    2006-01-01

    A recent version of the Parallel Virtual Machine (PVM) computer program has been enhanced to enable use of multiple processors in a single node of a Beowulf system (a cluster of personal computers that runs the Linux operating system). A previous version of PVM had been enhanced by addition of a software port, denoted BEOLIN, that enables the incorporation of a Beowulf system into a larger parallel processing system administered by PVM, as though the Beowulf system were a single computer in the larger system. BEOLIN spawns tasks on (that is, automatically assigns tasks to) individual nodes within the cluster. However, BEOLIN does not enable the use of multiple processors in a single node. The present enhancement adds support for a parameter in the PVM command line that enables the user to specify which Internet Protocol host address the code should use in communicating with other Beowulf nodes. This enhancement also provides for the case in which each node in a Beowulf system contains multiple processors. In this case, by making multiple references to a single node, the user can cause the software to spawn multiple tasks on the multiple processors in that node.

  3. Design and Implementation of 64-Bit Execute Stage for VLIW Processor Architecture on FPGA

    Directory of Open Access Journals (Sweden)

    Manju Rani

    2012-07-01

    Full Text Available FPGA implementation of 64-bit execute unit for VLIW processor, and improve power representation have been done in this paper. VHDL is used to modelled this architecture. VLIW stands for Very Long Instruction Word. This Processor Architecture is based on parallel processing in which more than one instruction is executed in parallel. This architecture is used to increase the instruction throughput. So this is the base of the modern Superscalar Processors. Basically VLIW is a RISC Processor. The difference is it contains long instruction as compared to RISC. This stage of the pipeline executes the instruction. This is the stage where the ALU (arithmetic logic unit is located. Execute stage are synthesized and targeted for Xilinx Virtex 4 FPGA and the results calculated for 64-bit Execute stage improve the power as compared to previous work done.

  4. Architecture and Design of Medical Processor Units for Medical Networks

    Directory of Open Access Journals (Sweden)

    Syed V. Ahamed

    2010-11-01

    Full Text Available This paper1 introduces analogical and deductive methodologies for the design medical processor units(MPUs. From the study of evolution of numerous earlier processors, we derive the basis for thearchitecture of MPUs. These specialized processors perform unique medical functions encoded as medicaloperational codes (mopcs. From a pragmatic perspective, MPUs function very close to CPUs. Bothprocessors have unique operation codes that command the hardware to perform a distinct chain of subprocessesupon operands and generate a specific result unique to the opcode and the operand(s. Inmedical environments, MPU decodes the mopcs and executes a series of medical sub-processes and sendsout secondary commands to the medical machine. Whereas operands in a typical computer system arenumerical and logical entities, the operands in medical machine are objects such as such as patients, bloodsamples, tissues, operating rooms, medical staff, medical bills, patient payments, etc. We follow thefunctional overlap between the two processes and evolve the design of medical computer systems andnetworks.

  5. MGSim - simulation tools for multi-core processor architectures

    NARCIS (Netherlands)

    Lankamp, M.; Poss, R.; Yang, Q.; Fu, J.; Uddin, I.; Jesshope, C.R.

    2013-01-01

    MGSim is an open source discrete event simulator for on-chip hardware components, developed at the University of Amsterdam. It is intended to be a research and teaching vehicle to study the fine-grained hardware/software interactions on many-core and hardware multithreaded processors. It includes su

  6. The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture

    CERN Document Server

    Biagioni, Andrea; Lonardo, Alessandro; Paolucci, Pier Stanislao; Perra, Mersia; Rossetti, Davide; Sidore, Carlo; Simula, Francesco; Tosoratto, Laura; Vicini, Piero

    2012-01-01

    One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processor's tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.

  7. Tinuso: A processor architecture for a multi-core hardware simulation platform

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; Karlsson, Sven

    2010-01-01

    Multi-core systems have the potential to improve performance, energy and cost properties of embedded systems but also require new design methods and tools to take advantage of the new architectures. Due to the limited accuracy and performance of pure software simulators, we are working on a cycle...... accurate hardware simulation platform. We have developed the Tinuso processor architecture for this platform. Tinuso is a processor architecture optimized for FPGA implementation. The instruction set makes use of predicated instructions and supports C/C++ and assembly language programming. It is designed...... to be easy extendable to maintain the exibility required for the research on multi-core systems. Tinuso contains a co-processor interface to connect to a network interface. This interface allow for communication over an on-chip network. A clock frequency estimation study on a deeply pipelined Tinuso...

  8. A Reversible Processor Architecture and its Reversible Logic Design

    DEFF Research Database (Denmark)

    Thomsen, Michael Kirkedal; Axelsen, Holger Bock; Glück, Robert

    2012-01-01

    We describe the design of a purely reversible computing architecture, Bob, and its instruction set, BobISA. The special features of the design include a simple, yet expressive, locally-invertible instruction set, and fully reversible control logic and address calculation. We have designed...... an architecture with an ISA that is expressive enough to serve as the target for a compiler from a high-level structured reversible programming language. All-in-all, this paper demonstrates that the design of a complete reversible computing architecture is possible and can serve as the core of a programmable...

  9. Extending and implementing the Self-adaptive Virtual Processor for distributed memory architectures

    NARCIS (Netherlands)

    van Tol, M.W.; Koivisto, J.

    2011-01-01

    Many-core architectures of the future are likely to have distributed memory organizations and need fine grained concurrency management to be used effectively. The Self-adaptive Virtual Processor (SVP) is an abstract concurrent programming model which can provide this, but the model and its current i

  10. Extending and implementing the Self-adaptive Virtual Processor for distributed memory architectures

    NARCIS (Netherlands)

    van Tol, M.W.; Koivisto, J.

    2011-01-01

    Many-core architectures of the future are likely to have distributed memory organizations and need fine grained concurrency management to be used effectively. The Self-adaptive Virtual Processor (SVP) is an abstract concurrent programming model which can provide this, but the model and its current

  11. Reversible machine code and its abstract processor architecture

    DEFF Research Database (Denmark)

    Axelsen, Holger Bock; Glück, Robert; Yokoyama, Tetsuo

    2007-01-01

    A reversible abstract machine architecture and its reversible machine code are presented and formalized. For machine code to be reversible, both the underlying control logic and each instruction must be reversible. A general class of machine instruction sets was proven to be reversible, building...

  12. Stagnant Timing investigation of Embedded Software on Advanced Processor Architectures

    Directory of Open Access Journals (Sweden)

    M.Shankar

    2012-01-01

    Full Text Available Most processors today are embedded inproducts like mobile phones, microwave owns, weldingmachines etc and are not used in PC’s as many believeSince some of these embedded computers are used in time-critical or safety-critical systems it is very important thatthe behaviour of these systems are well known. One part ofthat is to know the Worst Case Execution Time (WCET ofthe different tasks in the embedded system. First,shortcomings in current as well as future standards tocontrolling the power grid are outlined. From theseeconomic and safety threats, we derive an immediate needto invest in research on the protection of the power grid,both from the perspective of cyber attacks and distributedcontrol system problems. Second, current software designpractice does not adequately verify and validate worst-casetiming scenarios that have to be guaranteed in order tomeet deadlines in safety-critical embedded systems. Thisequally applies to avionics and the automotive industry,both of which are increasingly requiring their suppliers toprovide variable bounds on worst-case execution time ofsoftware.

  13. Scalable Architecture for a Room Temperature Solid-State Quantum Information Processor

    CERN Document Server

    Yao, Norman Y; Gorshkov, Alexey V; Maurer, Peter C; Giedke, Geza; Cirac, J Ignacio; Lukin, Mikhail D

    2010-01-01

    The realization of a scalable quantum information processor has emerged over the past decade as one of the central challenges at the interface of fundamental science and engineering. Much progress has been made towards this goal. Indeed, quantum operations have been demonstrated on several trapped ion qubits, and other solid-state systems are approaching similar levels of control. Extending these techniques to achieve fault-tolerant operations in larger systems with more qubits remains an extremely challenging goal, in part, due to the substantial technical complexity of current implementations. Here, we propose and analyze an architecture for a scalable, solid-state quantum information processor capable of operating at or near room temperature. The architecture is applicable to realistic conditions, which include disorder and relevant decoherence mechanisms, and includes a hierarchy of control at successive length scales. Our approach is based upon recent experimental advances involving Nitrogen-Vacancy colo...

  14. An FFT Performance Model for Optimizing General-Purpose Processor Architecture

    Institute of Scientific and Technical Information of China (English)

    Ling Li; Yun-Ji Chen; Dao-Fu Liu; Cheng Qian; Wei-Wu Hu

    2011-01-01

    General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT),due to its flexibility,reliability and practicality.FFT is a representative application intensive in both computation and memory access,optimizing the FFT performance of a GPP also benefits the performances of many other applications.To facilitate the analysis of FFT,this paper proposes a theoretical model of the FFT processing.The model gives out a tight lower bound of the runtime of FFT on a GPP,and guides the architecture optimization for GPP as well.Based on the model,two theorems on optimization of architecture parameters are deduced,which refer to the lower bounds of register number and memory bandwidth.Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.The above investigations were adopted in the development of Godson-3B,which is an industrial GPP.The optimization techniques deduced from our performance model improve the FFT performance by about 40%,while incurring only 0.8% additional area cost.Consequently,Godson-3B solves the 1024-point single-precision complex FFT in 0.368 μs with about 40 Watt power consumption,and has the highest performance-per-watt in complex FFT among processors as far as we know.This work could benefit optimization of other GPPs as well.

  15. GASP-PL/I Simulation of Integrated Avionic System Processor Architectures. M.S. Thesis

    Science.gov (United States)

    Brent, G. A.

    1978-01-01

    A development study sponsored by NASA was completed in July 1977 which proposed a complete integration of all aircraft instrumentation into a single modular system. Instead of using the current single-function aircraft instruments, computers compiled and displayed inflight information for the pilot. A processor architecture called the Team Architecture was proposed. This is a hardware/software approach to high-reliability computer systems. A follow-up study of the proposed Team Architecture is reported. GASP-PL/1 simulation models are used to evaluate the operating characteristics of the Team Architecture. The problem, model development, simulation programs and results at length are presented. Also included are program input formats, outputs and listings.

  16. Monte Carlo simulations on SIMD computer architectures. [Single instruction multiple data (SIMD)

    Energy Technology Data Exchange (ETDEWEB)

    Burmester, C.P.; Gronsky, R. (Lawrence Berkeley Lab., CA (United States)); Wille, L.T. (Florida Atlantic Univ., Boca Raton, FL (United States). Dept. of Physics)

    1992-03-01

    Algorithmic considerations regarding the implementation of various materials science applications of the Monte Carlo technique to single instruction multiple data (SMM) computer architectures are presented. In particular, implementation of the Ising model with nearest, next nearest, and long range screened Coulomb interactions on the SIMD architecture MasPar MP-1 (DEC mpp-12000) series of massively parallel computers is demonstrated. Methods of code development which optimize processor array use and minimize inter-processor communication are presented including lattice partitioning and the use of processor array spanning tree structures for data reduction. Both geometric and algorithmic parallel approaches are utilized. Benchmarks in terms of Monte Carlo updates per second for the MasPar architecture are presented and compared to values reported in the literature from comparable studies on other architectures.

  17. DATA BYPASSING ARCHITECTURE AND CIRCUIT DESIGN FOR 32-BIT DIGITAL SIGNAL PROCESSOR

    Institute of Scientific and Technical Information of China (English)

    Chen Xiaoyi; Yao Qingdong; Liu Peng

    2005-01-01

    This paper presents a design method of ByPassing Unit(BPU) in 32-bit Digital Signal Processor(DSP)-MD32. MD32 is realized in 0,18μm technology, 1.8V and 200 MHz working clock. It focuses on the Reduced Instruction Set Computer(RISC) architecture and DSP computation capability thoroughly, extends DSP with various addressing modes in a customized DSP pipeline stage architecture. The paper also discusses the architecture and circuit design of bypassing logic to fit MD32 architecture. The parallel execution of BPU with instruction decode in architecture level is applied to reduce time delay. The optimization of circuit that serial select with priority is analyzed in detail, and the result shows that about half of time delay is reduced after this optimization. Examples show that BPU is useful for improving the DSP's performance.The forwarding logic in MD32 realizes 8 data channels feedback and meets the working clock limit.

  18. Server-Based Data Push Architecture for Multi-Processor Environments

    Institute of Scientific and Technical Information of China (English)

    Xian-He Sun; Surendra Byna; Yong Chen

    2007-01-01

    Data access delay is a major bottleneck in utilizing current high-end computing (HEC) machines. Prefetching, where data is fetched before CPU demands for it, has been considered as an effective solution to masking data access delay. However, current client-initiated prefetching strategies, where a computing processor initiates prefetching instructions, have many limitations. They do not work well for applications with complex, non-contiguous data access patterns. While technology advances continue to increase the gap between computing and data access performance,trading computing power for reducing data access delay has become a natural choice. In this paper, we present a serverbased data-push approach and discuss its associated implementation mechanisms. In the server-push architecture, a dedicated server called Data Push Server (DPS) initiates and proactively pushes data closer to the client in time. Issues,such as what data to fetch, when to fetch, and how to push are studied. The SimpleScalar simulator is modified with a dedicated prefetching engine that pushes data for another processor to test DPS based prefetching. Simulation results show that L1 Cache miss rate can be reduced by up to 97% (71% on average) over a superscalar processor for SPEC CPU2000 benchmarks that have high cache miss rates.

  19. Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing

    Science.gov (United States)

    Sterling, T. L.; Zima, H. P.

    2002-01-01

    Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commodity clusters. In this paper we describe the design of Gilgamesh, a PIM-based massively parallel architecture, and elements of its execution model. Gilgamesh extends existing PIM capabilities by incorporating advanced mechanisms for virtualizing tasks and data and providing adaptive resource management for load balancing and latency tolerance. The Gilgamesh execution model is based on macroservers, a middleware layer which supports object-based runtime management of data and threads allowing explicit and dynamic control of locality and load balancing. The paper concludes with a discussion of related research activities and an outlook to future work.

  20. Heterogeneous reconfigurable processors for real-time baseband processing from algorithm to architecture

    CERN Document Server

    Zhang, Chenxin; Öwall, Viktor

    2016-01-01

    This book focuses on domain-specific heterogeneous reconfigurable architectures, demonstrating for readers a computing platform which is flexible enough to support multiple standards, multiple modes, and multiple algorithms. The content is multi-disciplinary, covering areas of wireless communication, computing architecture, and circuit design. The platform described provides real-time processing capability with reasonable implementation cost, achieving balanced trade-offs among flexibility, performance, and hardware costs. The authors discuss efficient design methods for wireless communication processing platforms, from both an algorithm and architecture design perspective. Coverage also includes computing platforms for different wireless technologies and standards, including MIMO, OFDM, Massive MIMO, DVB, WLAN, LTE/LTE-A, and 5G. •Discusses reconfigurable architectures, including hardware building blocks such as processing elements, memory sub-systems, Network-on-Chip (NoC), and dynamic hardware reconfigur...

  1. Design of highly efficient elliptic curve crypto-processor with two multiplications over GF(2163)

    Institute of Scientific and Technical Information of China (English)

    DAN Yong-ping; ZOU Xue-cheng; LIU Zheng-lin; HAN Yu; YI Li-hua

    2009-01-01

    In this article, a parallel hardware processor is presented to compute elliptic curve scalar multiplication in polynomial basis representation. The processor is applicable to the operations of scalar multiplication by using a modular arithmetic logic unit (MALU). The MALU consists of two multiplications, one addition, and one squaring. The two multiplications and the addition or squaring can be computed in parallel. The whole computations of scalar multiplication over GF(2163) can be performed in 3 064 cycles. The simulation results based on Xilinx Virtex2 XC2V6000 FPGAs show that the proposed design can compute random GF(2163) elliptic curve scalar multiplication operations in 31.17 μs, and the resource occupies 3 994 registers and 15 527 LUTs, which indicates that the crypto-processor is suitable for high-performance application.

  2. The application of compiler-assisted multiple instruction retry to VLIW architectures

    Science.gov (United States)

    Chen, Shyh-Kwei; Fuchs, W. K.; Hwu, Wen-Mei W.

    1994-01-01

    Very Long Instruction Word (VLIW) architectures enhance performance by exploiting fine-grained instruction level parallelism. We describe the development of two compiler assisted multiple instruction word retry schemes for VLIW architectures. The first scheme utilizes the compiler techniques previously developed for processors with single functional units. A compiler generated hazard-free code with different degrees of rollback capability for uniprocessors is compacted by a modified VLIW trace scheduling algorithm. Nops are then inserted in the scheduled code words to resolve data hazards for VLIW architectures. Performance is compared under three parameters: the rollback distance for uni-processors; the number of functional units; and the rollback distance for VLIW architectures. The second scheme employs a hardware read buffer to resolve frequently occurring data hazards, and utilizes the compiler to resolve the remaining hazards. Performance results are shown for six benchmark programs.

  3. Architecture-Aware Session Lookup Design for Inline Deep Inspection on Network Processors

    Institute of Scientific and Technical Information of China (English)

    XU Bo; HE Fei; XUE Yibo; LI Jun

    2009-01-01

    Today's firewalls and security gateways are required to not only block unauthorized accesses by authenticating packet headers,but also inspect flow payloads against malicious intrusions.Deep inspection emerges as a seamless integration of packet classification for access control and pattern matching for intrusion prevention.The two function blocks are linked together via well-designed session lookup schemes.This paper presents an architecture-aware session lookup scheme for deep inspection on network processors (NPs).Test results show that the proposed session data structure and integration approach can achieve the OC-48 line rate (2.5 Gbps) with inline stateful content inspection on the Intel IXP2850 NP.This work provides an insight into application design and implementation on NPs and principles for performance tuning of NP-based programming such as data allocation,task partitioning,latency hiding,and thread synchronization.

  4. ESAIR: A Behavior-Based Robotic Software Architecture on Multi-Core Processor Platforms

    Directory of Open Access Journals (Sweden)

    Chin-Yuan Tseng

    2013-03-01

    Full Text Available This paper introduces an Embedded Software Architecture for Intelligent Robot systems (ESAIR that addresses the issues of parallel thread executions on multi-core processor platforms. ESAIR provides a thread scheduling interface to improve the execution performance of a robot system by assigning a dedicated core to a running thread on the fly and dynamically rescheduling the priority of the thread. In the paper, we describe the object-oriented design and the control functions of ESAIR. The modular design of ESAIR helps improve the software quality, reliability and scalability in research and real practice. We prove the improvement by realizing ESAIR on an autonomous robot, named AVATAR. AVATAR implements various human-robot interactions, including speech recognition, human following, face recognition, speaker identification, etc. With the support of ESAIR, AVATAR can integrate a comprehensive set of behaviors and peripherals with better resource utilization.

  5. Evaluation of Dual-Launch Lunar Architectures Using the Mission Assessment Post Processor

    Science.gov (United States)

    Stewart, Shaun M.; Senent, Juan; Williams, Jacob; Condon, Gerald L.; Lee, David E.

    2010-01-01

    The National Aeronautics and Space Administrations (NASA) Constellation Program is currently designing a new transportation system to replace the Space Shuttle, support human missions to both the International Space Station (ISS) and the Moon, and enable the eventual establishment of an outpost on the lunar surface. The present Constellation architecture is designed to meet nominal capability requirements and provide flexibility sufficient for handling a host of contingency scenarios including (but not limited to) launch delays at the Earth. This report summarizes a body of work performed in support of the Review of U.S. Human Space Flight Committee. It analyzes three lunar orbit rendezvous dual-launch architecture options which incorporate differing methodologies for mitigating the effects of launch delays at the Earth. NASA employed the recently-developed Mission Assessment Post Processor (MAPP) tool to quickly evaluate vehicle performance requirements for several candidate approaches for conducting human missions to the Moon. The MAPP tool enabled analysis of Earth perturbation effects and Earth-Moon geometry effects on the integrated vehicle performance as it varies over the 18.6-year lunar nodal cycle. Results are provided summarizing best-case and worst-case vehicle propellant requirements for each architecture option. Additionally, the associated vehicle payload mass requirements at launch are compared between each architecture and against those of the Constellation Program. The current Constellation Program architecture assumes that the Altair lunar lander and Earth Departure Stage (EDS) vehicles are launched on a heavy lift launch vehicle. The Orion Crew Exploration Vehicle (CEV) is separately launched on a smaller man-rated vehicle. This strategy relaxes man-rating requirements for the heavy lift launch vehicle and has the potential to significantly reduce the cost of the overall architecture over the operational lifetime of the program. The crew launch

  6. Multiple-Morphs Adaptive Stream Architecture

    Institute of Scientific and Technical Information of China (English)

    Mei Wen; Nan Wu; Hai-Yan Li; Chun-Yuan Zhang

    2005-01-01

    In modern VLSI technology, hundreds of thousands of arithmetic units fit on a 1cm2 chip. The challenge is supplying them with instructions and data. Stream architecture is able to solve the problem well. However, the applications suited for typical stream architecture are limited. This paper presents the definition of regular stream and irregular stream,and then describes MASA (Multiple-morphs Adaptive Stream Architecture) prototype system which supports different execution models according to applications' stream characteristics. This paper first discusses MASA architecture and stream model, and then explores the features and advantages of MASA through mapping stream applications to hardware.Finally MASA is evaluated by ten benchmarks. The result is encouraging.

  7. A novel reconfigurable optical interconnect architecture using an Opto-VLSI processor and a 4-f imaging system.

    Science.gov (United States)

    Shen, Mingya; Xiao, Feng; Alameh, Kamal

    2009-12-07

    A novel reconfigurable optical interconnect architecture for on-board high-speed data transmission is proposed and experimentally demonstrated. The interconnect architecture is based on the use of an Opto-VLSI processor in conjunction with a 4-f imaging system to achieve reconfigurable chip-to-chip or board-to-board data communications. By reconfiguring the phase hologram of an Opto-VLSI processor, optical data generated by a vertical Cavity Surface Emitting Laser (VCSEL) associated to a chip (or a board) is arbitrarily steered to the photodetector associated to another chip (or another board). Experimental results show that the optical interconnect losses range from 5.8dB to 9.6dB, and that the maximum crosstalk level is below -36dB. The proposed architecture is tested for high-speed data transmission, and measured eye diagrams display good eye opening for data rate of up to 10Gb/s.

  8. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors

    DEFF Research Database (Denmark)

    Liu, Weifeng; Vinter, Brian

    2015-01-01

    Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this ......Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency....... In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part...... of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over...

  9. Scheduling Algorithm: Tasks Scheduling Algorithm for Multiple Processors with Dynamic Reassignment

    Directory of Open Access Journals (Sweden)

    Pradeep Kumar Yadav

    2008-01-01

    Full Text Available Distributed computing systems [DCSs] offer the potential for improved performance and resource sharing. To make the best use of the computational power available, it is essential to assign the tasks dynamically to that processor whose characteristics are most appropriate for the execution of the tasks in distributed processing system. We have developed a mathematical model for allocating “M” tasks of distributed program to “N” multiple processors (M>N that minimizes the total cost of the program. Relocating the tasks from one processor to another at certain points during the course of execution of the program that contributes to the total cost of the running program has been taken into account. Phasewise execution cost [EC], intertask communication cost [ITCT], residence cost [RC] of each task on different processors, and relocation cost [REC] for each task have been considered while preparing a dynamic tasks allocation model. The present model is suitable for arbitrary number of phases and processors with random program structure.

  10. Multiple directed graph large-class multi-spectral processor

    Science.gov (United States)

    Casasent, David; Liu, Shiaw-Dong; Yoneyama, Hideyuki

    1988-01-01

    Numerical analysis techniques for the interpretation of high-resolution imaging-spectrometer data are described and demonstrated. The method proposed involves the use of (1) a hierarchical classifier with a tree structure generated automatically by a Fisher linear-discriminant-function algorithm and (2) a novel multiple-directed-graph scheme which reduces the local maxima and the number of perturbations required. Results for a 500-class test problem involving simulated imaging-spectrometer data are presented in tables and graphs; 100-percent-correct classification is achieved with an improvement factor of 5.

  11. SAD PROCESSOR FOR MULTIPLE MACROBLOCK MATCHING IN FAST SEARCH VIDEO MOTION ESTIMATION

    Directory of Open Access Journals (Sweden)

    Nehal N. Shah

    2015-02-01

    Full Text Available Motion estimation is a very important but computationally complex task in video coding. Process of determining motion vectors based on the temporal correlation of consecutive frame is used for video compression. In order to reduce the computational complexity of motion estimation and maintain the quality of encoding during motion compensation, different fast search techniques are available. These block based motion estimation algorithms use the sum of absolute difference (SAD between corresponding macroblock in current frame and all the candidate macroblocks in the reference frame to identify best match. Existing implementations can perform SAD between two blocks using sequential or pipeline approach but performing multi operand SAD in single clock cycle with optimized recourses is state of art. In this paper various parallel architectures for computation of the fixed block size SAD is evaluated and fast parallel SAD architecture is proposed with optimized resources. Further SAD processor is described with 9 processing elements which can be configured for any existing fast search block matching algorithm. Proposed SAD processor consumes 7% fewer adders compared to existing implementation for one processing elements. Using nine PE it can process 84 HD frames per second in worse case which is good outcome for real time implementation. In average case architecture process 325 HD frames per second.

  12. An Architectural Approach for Decoding and Distributing Functions in FPUs in a Functional Processor System

    CERN Document Server

    Nair, T R Gopalakrishnan; Krutthika, H K

    2010-01-01

    The main goal of this research is to develop the concepts of a revolutionary processor system called Functional Processor System. The fairly novel work carried out in this proposal concentrates on decoding of function pipelines and distributing it in FPUs as a part of scheduling approach. As the functional programs are super-level programs that entails requirements only at functional level, decoding of functions and distribution of functions in the heterogeneous functional processor units are a challenge. We explored the possibilities of segregation of the functions from the application program and distributing the functions on the relevant FPUs by using address mapping techniques. Here we pursue the perception of feeding the functions into the processor farm rather than the processor fetching the instructions or functions and executing it. This work is carried out at theoretical levels and it requires a long way to go in the realization of this work in hardware perhaps with a large industrial team with a pra...

  13. Concept of a Supervector Processor: A Vector Approach to Superscalar Processor, Design and Performance Analysis

    Directory of Open Access Journals (Sweden)

    Deepak Kumar, Ranjan Kumar Behera, K. S. Pandey

    2013-07-01

    Full Text Available To maximize the available performance is always a goal in microprocessor design. In this paper a new technique has been implemented which exploits the advantage of both superscalar and vector processing technique in a proposed processor called Supervector processor. Vector processor operates on array of data called vector and can greatly improve certain task such as numerical simulation and tasks which requires huge number crunching. On other handsuperscalar processor issues multiple instructions per cyclewhich can enhance the throughput. To implement parallelism multiple vector instructions were issued and executed per cycle in superscalar fashion. Case study has been done on various benchmarks to compare the performance of proposedsupervector processor architecture with superscalar and vectorprocessor architecture. Trimaran Framework has been used in order to evaluate the performance of the proposed supervector processor scheme.

  14. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    OpenAIRE

    Kaufmann Michael; Nieselt Kay; Schmollinger Martin; Morgenstern Burkhard

    2004-01-01

    Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pa...

  15. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    Directory of Open Access Journals (Sweden)

    Kaufmann Michael

    2004-09-01

    Full Text Available Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.

  16. Knowledge Framework Implementation with Multiple Architectures - 13090

    Energy Technology Data Exchange (ETDEWEB)

    Upadhyay, H.; Lagos, L.; Quintero, W.; Shoffner, P. [Applied Research Center, Florida International University, Miami, FL 33174 (United States); DeGregory, J. [Office of D and D and Facility Engineering, Environmental Management, Department of Energy (United States)

    2013-07-01

    Multiple kinds of knowledge management systems are operational in public and private enterprises, large and small organizations with a variety of business models that make the design, implementation and operation of integrated knowledge systems very difficult. In recent days, there has been a sweeping advancement in the information technology area, leading to the development of sophisticated frameworks and architectures. These platforms need to be used for the development of integrated knowledge management systems which provides a common platform for sharing knowledge across the enterprise, thereby reducing the operational inefficiencies and delivering cost savings. This paper discusses the knowledge framework and architecture that can be used for the system development and its application to real life need of nuclear industry. A case study of deactivation and decommissioning (D and D) is discussed with the Knowledge Management Information Tool platform and framework. D and D work is a high priority activity across the Department of Energy (DOE) complex. Subject matter specialists (SMS) associated with DOE sites, the Energy Facility Contractors Group (EFCOG) and the D and D community have gained extensive knowledge and experience over the years in the cleanup of the legacy waste from the Manhattan Project. To prevent the D and D knowledge and expertise from being lost over time from the evolving and aging workforce, DOE and the Applied Research Center (ARC) at Florida International University (FIU) proposed to capture and maintain this valuable information in a universally available and easily usable system. (authors)

  17. Architectural considerations in the design of a superconducting quantum annealing processor

    OpenAIRE

    Bunyk, P. I.; Hoskinson, E.; Johnson, M. W.; Tolkacheva, E.; Altomare, F.; Berkley, A. J.; Harris, R; Hilton, J. P.; Lanting, T.; Whittaker, J

    2014-01-01

    We have developed a quantum annealing processor, based on an array of tunably coupled rf-SQUID flux qubits, fabricated in a superconducting integrated circuit process [1]. Implementing this type of processor at a scale of 512 qubits and 1472 programmable inter-qubit couplers and operating at ~ 20 mK has required attention to a number of considerations that one may ignore at the smaller scale of a few dozen or so devices. Here we discuss some of these considerations, and the delicate balance n...

  18. A soft-core processor architecture optimised for radar signal processing applications

    CSIR Research Space (South Africa)

    Broich, R

    2013-12-01

    Full Text Available Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4 Datapath Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.5 Control Unit Architecture...View for FPGA, MATLAB/Sim- ulink plug-ins: Xilinx Sys- tem Generator, Altera DSP Builder) Graphical interconnection of different DSP blocks, multi- rate systems, fused datapath optimisations. Difficult to de- scribe and debug complex designs, no design trade...

  19. Design Methodology for Multiple Microcomputer Architectures.

    Science.gov (United States)

    1982-07-01

    multimicro design knowledge is true both in industry and in university environments. In the industrial environment, it reduces productivity and increases...Real-Time Processor Problems," Proc. of ELECTRO-81 Tercer Seminario de Ingenieria Electronica , Nov. 9-13, 1981. 14 1981 "D Flip/Flop Substracts

  20. An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications.

    Science.gov (United States)

    Park, Seong-Wook; Park, Junyoung; Bong, Kyeongryeol; Shin, Dongjoo; Lee, Jinmook; Choi, Sungpill; Yoo, Hoi-Jun

    2015-12-01

    Deep Learning algorithm is widely used for various pattern recognition applications such as text recognition, object recognition and action recognition because of its best-in-class recognition accuracy compared to hand-crafted algorithm and shallow learning based algorithms. Long learning time caused by its complex structure, however, limits its usage only in high-cost servers or many-core GPU platforms so far. On the other hand, the demand on customized pattern recognition within personal devices will grow gradually as more deep learning applications will be developed. This paper presents a SoC implementation to enable deep learning applications to run with low cost platforms such as mobile or portable devices. Different from conventional works which have adopted massively-parallel architecture, this work adopts task-flexible architecture and exploits multiple parallelism to cover complex functions of convolutional deep belief network which is one of popular deep learning/inference algorithms. In this paper, we implement the most energy-efficient deep learning and inference processor for wearable system. The implemented 2.5 mm × 4.0 mm deep learning/inference processor is fabricated using 65 nm 8-metal CMOS technology for a battery-powered platform with real-time deep inference and deep learning operation. It consumes 185 mW average power, and 213.1 mW peak power at 200 MHz operating frequency and 1.2 V supply voltage. It achieves 411.3 GOPS peak performance and 1.93 TOPS/W energy efficiency, which is 2.07× higher than the state-of-the-art.

  1. Handling Multiple Ecologies in Architectural Design

    DEFF Research Database (Denmark)

    Lotz, Katrine; Sattrup, Peter Andreas

    2014-01-01

    , able to accommodate change over time is a necessary development in architectural conceptualization, what are the barriers and problems inherent in the present design culture, and how may these be overcome? In an educational experiment, architecture and architectural engineering students were asked......In light of the many challenges of resource scarcity, climate change, rapid urbanization and changing social patterns facing societies today, main stream architecture remains remarkably 'resilient' to conceptual innovation regarding its nature and role in society. If the idea of open architecture...... to imagine that their recent housing projects had been built and occupied for 25+ years. The students were given the task to transform each other's projects according to new social programs, increased urban density and strict energy and resources use paradigms, using a design methodological framework...

  2. Performance Analysis of a Hybrid Overset Multi-Block Application on Multiple Architectures

    Science.gov (United States)

    Djomehri, M. Jahed; Biswas, Rupak

    2003-01-01

    This paper presents a detailed performance analysis of a multi-block overset grid compu- tational fluid dynamics app!ication on multiple state-of-the-art computer architectures. The application is implemented using a hybrid MPI+OpenMP programming paradigm that exploits both coarse and fine-grain parallelism; the former via MPI message passing and the latter via OpenMP directives. The hybrid model also extends the applicability of multi-block programs to large clusters of SNIP nodes by overcoming the restriction that the number of processors be less than the number of grid blocks. A key kernel of the application, namely the LU-SGS linear solver, had to be modified to enhance the performance of the hybrid approach on the target machines. Investigations were conducted on cacheless Cray SX6 vector processors, cache-based IBM Power3 and Power4 architectures, and single system image SGI Origin3000 platforms. Overall results for complex vortex dynamics simulations demonstrate that the SX6 achieves the highest performance and outperforms the RISC-based architectures; however, the best scaling performance was achieved on the Power3.

  3. A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

    Energy Technology Data Exchange (ETDEWEB)

    Aliaga, José I., E-mail: aliaga@uji.es [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Alonso, Pedro [Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València (Spain); Badía, José M. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Chacón, Pablo [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Davidović, Davor [Rudjer Bošković Institute, Centar za Informatiku i Računarstvo – CIR, Zagreb (Croatia); López-Blanco, José R. [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Quintana-Ortí, Enrique S. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain)

    2016-03-15

    We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousands degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.

  4. QuickProbs--a fast multiple sequence alignment algorithm designed for graphics processors.

    Science.gov (United States)

    Gudyś, Adam; Deorowicz, Sebastian

    2014-01-01

    Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors.

  5. QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors

    Science.gov (United States)

    Gudyś, Adam; Deorowicz, Sebastian

    2014-01-01

    Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors. PMID:24586435

  6. QuickProbs--a fast multiple sequence alignment algorithm designed for graphics processors.

    Directory of Open Access Journals (Sweden)

    Adam Gudyś

    Full Text Available Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors.

  7. TSC696E: a SPARC V8 processor with SEU protection built-in architecture

    Science.gov (United States)

    Corbiere, T.

    2002-12-01

    To fulfill the always increasing computing power required by airborne equipment's, ESA has carried out the development of the LEON, a SPARC v8 processor. This device, available as a soft macro, is commonly targeting commercial and space markets. Therefore, to resist against radiation inherent to space environment, its design is highly configurable so as SEU protection mechanisms can be implemented without significant penalty. The work described in this paper presents the design by Atmel(1) of a 0.35μm technology demonstrator implementing the Fault Tolerant version of the LEON aiming at the validation of the SPARC standard compliance as well as SEU protection efficiency verification. Both functional validation and radiation results are reviewed with special attention to SEU induced errors circumventing actions.

  8. Media processors using a new microsystem architecture designed for the Internet era

    Science.gov (United States)

    Wyland, David C.

    1999-12-01

    The demands of digital image processing, communications and multimedia applications are growing more rapidly than traditional design methods can fulfill them. Previously, only custom hardware designs could provide the performance required to meet the demands of these applications. However, hardware design has reached a crisis point. Hardware design can no longer deliver a product with the required performance and cost in a reasonable time for a reasonable risk. Software based designs running on conventional processors can deliver working designs in a reasonable time and with low risk but cannot meet the performance requirements. What is needed is a media processing approach that combines very high performance, a simple programming model, complete programmability, short time to market and scalability. The Universal Micro System (UMS) is a solution to these problems. The UMS is a completely programmable (including I/O) system on a chip that combines hardware performance with the fast time to market, low cost and low risk of software designs.

  9. Handling Multiple Ecologies in Architectural Design

    DEFF Research Database (Denmark)

    Lotz, Katrine; Sattrup, Peter Andreas

    2014-01-01

    if architects are to address the increasing demands for resource optimization and environmental performance with great precision, but the experiment also showed promises in resolving design problems with multifaceted solutions addressing social and environmental issues simultaneously. The methodological......, able to accommodate change over time is a necessary development in architectural conceptualization, what are the barriers and problems inherent in the present design culture, and how may these be overcome? In an educational experiment, architecture and architectural engineering students were asked...... to imagine that their recent housing projects had been built and occupied for 25+ years. The students were given the task to transform each other's projects according to new social programs, increased urban density and strict energy and resources use paradigms, using a design methodological framework...

  10. HyperForest: A high performance multi-processor architecture for real-time intelligent systems

    Energy Technology Data Exchange (ETDEWEB)

    Garcia, P. Jr.; Rebeil, J.P. [Sandia National Labs., Albuquerque, NM (United States); Pollard, H. [Univ. of New Mexico, Albuquerque, NM (United States). Electrical Engineering and Computer Engineering Dept.

    1997-04-01

    Intelligent Systems are characterized by the intensive use of computer power. The computer revolution of the last few years is what has made possible the development of the first generation of Intelligent Systems. Software for second generation Intelligent Systems will be more complex and will require more powerful computing engines in order to meet real-time constraints imposed by new robots, sensors, and applications. A multiprocessor architecture was developed that merges the advantages of message-passing and shared-memory structures: expendability and real-time compliance. The HyperForest architecture will provide an expandable real-time computing platform for computationally intensive Intelligent Systems and open the doors for the application of these systems to more complex tasks in environmental restoration and cleanup projects, flexible manufacturing systems, and DOE`s own production and disassembly activities.

  11. A High-Throughput, Adaptive FFT Architecture for FPGA-Based Space-Borne Data Processors

    Science.gov (United States)

    Nguyen, Kayla; Zheng, Jason; He, Yutao; Shah, Biren

    2010-01-01

    Historically, computationally-intensive data processing for space-borne instruments has heavily relied on ground-based computing resources. But with recent advances in functional densities of Field-Programmable Gate-Arrays (FPGAs), there has been an increasing desire to shift more processing on-board; therefore relaxing the downlink data bandwidth requirements. Fast Fourier Transforms (FFTs) are commonly used building blocks for data processing applications, with a growing need to increase the FFT block size. Many existing FFT architectures have mainly emphasized on low power consumption or resource usage; but as the block size of the FFT grows, the throughput is often compromised first. In addition to power and resource constraints, space-borne digital systems are also limited to a small set of space-qualified memory elements, which typically lag behind the commercially available counterparts in capacity and bandwidth. The bandwidth limitation of the external memory creates a bottleneck for a large, high-throughput FFT design with large block size. In this paper, we present the Multi-Pass Wide Kernel FFT (MPWK-FFT) architecture for a moderately large block size (32K) with considerations to power consumption and resource usage, as well as throughput. We will also show that the architecture can be easily adapted for different FFT block sizes with different throughput and power requirements. The result is completely contained within an FPGA without relying on external memories. Implementation results are summarized.

  12. Design and Implementation of an Efficient Software Communications Architecture Core Framework for a Digital Signal Processors Platform

    Directory of Open Access Journals (Sweden)

    Wael A. Murtada

    2011-01-01

    Full Text Available Problem statement: The Software Communications Architecture (SCA was developed to improve software reuse and interoperability in Software Defined Radios (SDR. However, there have been performance concerns since its conception. Arguably, the majority of the problems and inefficiencies associated with the SCA can be attributed to the assumption of modular distributed platforms relying on General Purpose Processors (GPPs to perform all signal processing. Approach: Significant improvements in cost and power consumption can be obtained by utilizing specialized and more efficient platforms. Digital Signal Processors (DSPs present such a platform and have been widely used in the communications industry. Improvements in development tools and middleware technology opened the possibility of fully integrating DSPs into the SCA. This approach takes advantage of the exceptional power, cost and performance characteristics of DSPs, while still enjoying the flexibility and portability of the SCA. Results: This study presents the design and implementation of an SCA Core Framework (CF for a TI TMS320C6416 DSP. The framework is deployed on a C6416 Device Cycle Accurate Simulator and TI C6416 Development board. The SCA CF is implemented by leveraging OSSIE, an open-source implementation of the SCA, to support the DSP platform. OIS’s ORBExpress DSP and DSP/BIOS are used as the middleware and operating system, respectively. A sample waveform was developed to demonstrate the framework’s functionality. Benchmark results for the framework and sample applications are provided. Conclusion: Benchmark results show that, using OIS ORBExpress DSP ORB middleware has an impact for decreasing the Software Memory Footprint and increasing the System Performance compared with PrismTech's e*ORB middleware.

  13. An Efficient Architecture Design of Reconfigurable Float-point FFT Processor%高效可配置浮点FFT处理器设计

    Institute of Scientific and Technical Information of China (English)

    桑红石; 高伟

    2012-01-01

    Large resource cost is the design bottleneck of high-precision float-point FFT processor,a novel R2/22SDF reconfigurable architecture using shared-butterfly which employs single-port-based FIFO.Radix 2/22algorithm and pipeline architecture,which is suitable for float-point design,can reduce the multiplicative complexity and improve the multiplication efficiency.The FIFO memory using double-width single-port ram can avoid the larger area and power coat of dual-port ram.Two butterfly units can be merged by the proposed shared-butterfly architecture,which solves the low utilization factor problem of traditional single-path-delay-feedback architecture.The float-point design cost is efficiently reduced and the calculator utilization factor is improved,compared with the traditional pipeline method.%为了克服高精度浮点FFT处理器具有较大资源开销的设计瓶颈,采用基于单口存储器的FIFO构建共享蝶形结构的R2/22SDF流水可配置结构.采用适合浮点设计的基2/22算法实现流水结构,不仅有利于可配置电路的实现,还能够有效减少复数乘法次数,提高复数乘法器的计算效率.采用双倍数据位宽的单口存储器实现FIFO存储器,有效避免了双口存储器面积和功耗较大的问题.改进的蝶形共享结构实现两级蝶形的合并,解决了单路径延迟反馈流水线结构蝶形单元利用率低的问题.与传统流水线结构FFT处理器设计相比,有效降低了浮点设计中的资源开销,提高了计算单元的利用效率.

  14. High-Throughput, Adaptive FFT Architecture for FPGA-Based Spaceborne Data Processors

    Science.gov (United States)

    NguyenKobayashi, Kayla; Zheng, Jason X.; He, Yutao; Shah, Biren N.

    2011-01-01

    Exponential growth in microelectronics technology such as field-programmable gate arrays (FPGAs) has enabled high-performance spaceborne instruments with increasing onboard data processing capabilities. As a commonly used digital signal processing (DSP) building block, fast Fourier transform (FFT) has been of great interest in onboard data processing applications, which needs to strike a reasonable balance between high-performance (throughput, block size, etc.) and low resource usage (power, silicon footprint, etc.). It is also desirable to be designed so that a single design can be reused and adapted into instruments with different requirements. The Multi-Pass Wide Kernel FFT (MPWK-FFT) architecture was developed, in which the high-throughput benefits of the parallel FFT structure and the low resource usage of Singleton s single butterfly method is exploited. The result is a wide-kernel, multipass, adaptive FFT architecture. The 32K-point MPWK-FFT architecture includes 32 radix-2 butterflies, 64 FIFOs to store the real inputs, 64 FIFOs to store the imaginary inputs, complex twiddle factor storage, and FIFO logic to route the outputs to the correct FIFO. The inputs are stored in sequential fashion into the FIFOs, and the outputs of each butterfly are sequentially written first into the even FIFO, then the odd FIFO. Because of the order of the outputs written into the FIFOs, the depth of the even FIFOs, which are 768 each, are 1.5 times larger than the odd FIFOs, which are 512 each. The total memory needed for data storage, assuming that each sample is 36 bits, is 2.95 Mbits. The twiddle factors are stored in internal ROM inside the FPGA for fast access time. The total memory size to store the twiddle factors is 589.9Kbits. This FFT structure combines the benefits of high throughput from the parallel FFT kernels and low resource usage from the multi-pass FFT kernels with desired adaptability. Space instrument missions that need onboard FFT capabilities such as the

  15. Study and Design for a Java Processor Based on RISC Architecture%基于RISC结构的Java处理器研究与设计

    Institute of Scientific and Technical Information of China (English)

    张金钟; 胡平

    2011-01-01

    文中结合PicoJava和JOP等一些经典的Java处理器的优势,设计了一种基于RISC结构的Java处理器.它充分利用了Java指令折叠技术和精简指令集处理器的优势,不仅降低了设计复杂度,而且在很大程度上提高了Java处理器的性能.%In this paper, a new Java processor was designed, which combines the advantages of the classic Java processors such as Picojava and JOP, has high performance and low complexity features, takes full advantage of Java instruction folding technology and the Reduced Instruction Set Computer architecture of advantage. So it not only reduces the design complexity, but has greatly improved the performance of Java processors.

  16. Architecture for Multiple Interacting Robot Intelligences

    Science.gov (United States)

    Peters, Richard Alan, II (Inventor)

    2008-01-01

    An architecture for robot intelligence enables a robot to learn new behaviors and create new behavior sequences autonomously and interact with a dynamically changing environment. Sensory information is mapped onto a Sensory Ego-Sphere (SES) that rapidly identifies important changes in the environment and functions much like short term memory. Behaviors are stored in a database associative memory (DBAM) that creates an active map from the robot's current state to a goal state and functions much like long term memory. A dream state converts recent activities stored in the SES and creates or modifies behaviors in the DBAM.

  17. Noise limitations in optical linear algebra processors.

    Science.gov (United States)

    Batsell, S G; Jong, T L; Walkup, J F; Krile, T F

    1990-05-10

    A general statistical noise model is presented for optical linear algebra processors. A statistical analysis which includes device noise, the multiplication process, and the addition operation is undertaken. We focus on those processes which are architecturally independent. Finally, experimental results which verify the analytical predictions are also presented.

  18. Spaceborne Processor Array

    Science.gov (United States)

    Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

    2008-01-01

    A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

  19. The Genetic Architecture of Multiple Myeloma

    Directory of Open Access Journals (Sweden)

    Steven M. Prideaux

    2014-01-01

    Full Text Available Multiple myeloma is a malignant proliferation of monoclonal plasma cells leading to clinical features that include hypercalcaemia, renal dysfunction, anaemia, and bone disease (frequently referred to by the acronym CRAB which represent evidence of end organ failure. Recent evidence has revealed myeloma to be a highly heterogeneous disease composed of multiple molecularly-defined subtypes each with varying clinicopathological features and disease outcomes. The major division within myeloma is between hyperdiploid and nonhyperdiploid subtypes. In this division, hyperdiploid myeloma is characterised by trisomies of certain odd numbered chromosomes, namely, 3, 5, 7, 9, 11, 15, 19, and 21 whereas nonhyperdiploid myeloma is characterised by translocations of the immunoglobulin heavy chain alleles at chromosome 14q32 with various partner chromosomes, the most important of which being 4, 6, 11, 16, and 20. Hyperdiploid and nonhyperdiploid changes appear to represent early or even initiating mutagenic events that are subsequently followed by secondary aberrations including copy number abnormalities, additional translocations, mutations, and epigenetic modifications which lead to plasma cell immortalisation and disease progression. The following review provides a comprehensive coverage of the genetic and epigenetic events contributing to the initiation and progression of multiple myeloma and where possible these abnormalities have been linked to disease prognosis.

  20. Analyzing the trade-off between multiple memory controllers and memory channels on multi-core processor performance

    Energy Technology Data Exchange (ETDEWEB)

    Sancho Pitarch, Jose Carlos [Los Alamos National Laboratory; Kerbyson, Darren [Los Alamos National Laboratory; Lang, Mike [Los Alamos National Laboratory

    2010-01-01

    Increasing the core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend to cope with this challenge is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the effectiveness of this approach on the performance of parallel scientific applications. Specifically, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers. Experiments conducted on two current state-of-the-art multicore processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, for a wide range of production applications shows that there is a diminishing return when increasing the number of memory channels per memory controller. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers with each with one channel compared with one controller with two memory channels.

  1. Dual-core Itanium Processor

    CERN Multimedia

    2006-01-01

    Intel’s first dual-core Itanium processor, code-named "Montecito" is a major release of Intel's Itanium 2 Processor Family, which implements the Intel Itanium architecture on a dual-core processor with two cores per die (integrated circuit). Itanium 2 is much more powerful than its predecessor. It has lower power consumption and thermal dissipation.

  2. Heavy-traffic analysis of a multiple-phase network with discriminatory processor sharing

    NARCIS (Netherlands)

    Verloop, I.M.; Ayesta, U.; Núñez-Queija, R.

    2011-01-01

    We analyze a generalization of the discriminatory processor-sharing (DPS) queue in a heavy-traffic setting. Customers present in the system are served simultaneously at rates controlled by a vector of weights. We assume that customers have phase-type distributed service requirements and allow that c

  3. Hardware Synchronization for Embedded Multi-Core Processors

    DEFF Research Database (Denmark)

    Stoif, Christian; Schoeberl, Martin; Liccardi, Benito

    2011-01-01

    Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper, establi......Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper...

  4. S2 BHCA-Multiple AUVs cooperation oriented control architecture

    Institute of Scientific and Technical Information of China (English)

    2005-01-01

    Oceanographic survey, or other similar applications should be the applications of multiple AUVs. In this paper, the skill & simulation based hybrid control architecture (S2BHCA) as the controller's design reference was proposed. It is a multi-robot cooperation oriented intelligent control architecture based on hybrid ideas. The S2 BHCA attempts to incorporate the virtues of the reactive controller and of the deliberative controller by introducing the concept of the "skill". The additional online task simulation ability for cooperation is supported, too. As an application, a multiple AUV control system was developed with three "skills" for the MCM mission including two different cooperative tasks. The simulation and the sea trials show that simple task expression, fast reaction and better cooperation support can be achieved by realizing the AUV controller based on the S2 BHCA.

  5. Multiple-Channel Security Architecture and its Implementation over SSL

    Directory of Open Access Journals (Sweden)

    Song Yong

    2006-01-01

    Full Text Available This paper presents multiple-channel SSL (MC-SSL, an architecture and protocol for protecting client-server communications. In contrast to SSL, which provides a single end-to-end secure channel, MC-SSL enables applications to employ multiple channels, each with its own cipher suite and data-flow direction. Our approach also allows for several partially trusted application proxies. The main advantages of MC-SSL over SSL are (a support for end-to-end security in the presence of partially trusted proxies, and (b selective data protection for achieving computational efficiency important to resource-constrained clients and heavily loaded servers.

  6. 软件无线电数字信号处理器体系结构研究%Software Defined Radio Digital Signal Processor Architecture Research

    Institute of Scientific and Technical Information of China (English)

    刘衡竹; 莫方政; 张波涛; 赵恒; 刘冬培; 陈艇; 周理

    2009-01-01

    Software defined radio (SDR) has won much interest for being considered to be in line with the trend of wireless communication developrnent. Now the digital signal processor (DSP) is the bottleneck of software defined radio. The advantages and disadvantages of diverse architecture of software defined radio digital signal processor are summarized, and then the trends of software defined radio digital signal processor are discussed.%软件无线电因被认为是无线通信技术未来的发展趋势而受到广泛关注.目前数字信号处理器是软件无线电发展的瓶颈.通过分析、比较目前几种较为典型的软件无线电数字信号处理器结构,归纳总结各种结构各自设计出发点和优缺点,并对软件无线电数字信号处理器的发展趋势做了展望.

  7. Design concepts for a virtualizable embedded MPSoC architecture enabling virtualization in embedded multi-processor systems

    CERN Document Server

    Biedermann, Alexander

    2014-01-01

    Alexander Biedermann presents a generic hardware-based virtualization approach, which may transform an array of any off-the-shelf embedded processors into a multi-processor system with high execution dynamism. Based on this approach, he highlights concepts for the design of energy aware systems, self-healing systems as well as parallelized systems. For the latter, the novel so-called Agile Processing scheme is introduced by the author, which enables a seamless transition between sequential and parallel execution schemes. The design of such virtualizable systems is further aided by introduction

  8. CASPER: Embedding Power Estimation and Hardware-Controlled Power Management in a Cycle-Accurate Micro-Architecture Simulation Platform for Many-Core Multi-Threading Heterogeneous Processors

    Directory of Open Access Journals (Sweden)

    Arun Ravindran

    2012-02-01

    Full Text Available Despite the promising performance improvement observed in emerging many-core architectures in high performance processors, high power consumption prohibitively affects their use and marketability in the low-energy sectors, such as embedded processors, network processors and application specific instruction processors (ASIPs. While most chip architects design power-efficient processors by finding an optimal power-performance balance in their design, some use sophisticated on-chip autonomous power management units, which dynamically reduce the voltage or frequencies of idle cores and hence extend battery life and reduce operating costs. For large scale designs of many-core processors, a holistic approach integrating both these techniques at different levels of abstraction can potentially achieve maximal power savings. In this paper we present CASPER, a robust instruction trace driven cycle-accurate many-core multi-threading micro-architecture simulation platform where we have incorporated power estimation models of a wide variety of tunable many-core micro-architectural design parameters, thus enabling processor architects to explore a sufficiently large design space and achieve power-efficient designs. Additionally CASPER is designed to accommodate cycle-accurate models of hardware controlled power management units, enabling architects to experiment with and evaluate different autonomous power-saving mechanisms to study the run-time power-performance trade-offs in embedded many-core processors. We have implemented two such techniques in CASPER–Chipwide Dynamic Voltage and Frequency Scaling, and Performance Aware Core-Specific Frequency Scaling, which show average power savings of 35.9% and 26.2% on a baseline 4-core SPARC based architecture respectively. This power saving data accounts for the power consumption of the power management units themselves. The CASPER simulation platform also provides users with complete support of SPARCV9

  9. Revisiting Multiple Pattern Matching Algorithms for Multi-Core Architecture

    Institute of Scientific and Technical Information of China (English)

    Guang-Ming Tan; Ping Liu; Dong-Bo Bu; Yan-Bing Liu

    2011-01-01

    Due to the huge size of patterns to be searched,multiple pattern searching remains a challenge to several newly-arising applications like network intrusion detection.In this paper,we present an attempt to design efficient multiple pattern searching algorithms on multi-core architectures.We observe an important feature which indicates that the multiple pattern matching time mainly depends on the number and minimal length of patterns.The multi-core algorithm proposed in this paper leverages this feature to decompose pattern set so that the parallel execution time is minimized.We formulate the problem as an optimal decomposition and scheduling of a pattern set,then propose a heuristic algorithm,which takes advantage of dynamic programming and greedy algorithmic techniques,to solve the optimization problem.Experimental results suggest that our decomposition approach can increase the searching speed by more than 200% on a 4-core AMD Barcelona system.

  10. Evaluating New Architectural Features Of The Intel(R) Xeon(R) 7500 Processor For Hpc Workloads

    OpenAIRE

    2011-01-01

    In this paper we take a look at what the Intel Xeon Processor 7500 family, code namedNehalem-EX, brings to high performance computing. We compare two families of Intel Xeonbased systems (Intel Xeon 7500 and Intel Xeon 5600) and present a performance evolutionof 16 node clusters based on these CPUs. We compare CPU generations utilizing dual socketplatforms and a cluster across a number of HPC benchmarks and focused on differentperformance field and aspect. We will evaluate also technologies an...

  11. Design Principles for Synthesizable Processor Cores

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

    2012-01-01

    As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...... on FPGA-based processor cores: first, superpipelining enables higher-frequency system clocks, and second, predicated instructions circumvent costly pipeline stalls due to branches. To evaluate their effects, we develop Tinuso, a processor architecture optimized for FPGA implementation. We demonstrate...

  12. ETHERNET PACKET PROCESSOR FOR SOC APPLICATION

    Directory of Open Access Journals (Sweden)

    Raja Jitendra Nayaka

    2012-07-01

    Full Text Available As the demand for Internet expands significantly in numbers of users, servers, IP addresses, switches and routers, the IP based network architecture must evolve and change. The design of domain specific processors that require high performance, low power and high degree of programmability is the bottleneck in many processor based applications. This paper describes the design of ethernet packet processor for system-on-chip (SoC which performs all core packet processing functions, including segmentation and reassembly, packetization classification, route and queue management which will speedup switching/routing performance. Our design has been configured for use with multiple projects ttargeted to a commercial configurable logic device the system is designed to support 10/100/1000 links with a speed advantage. VHDL has been used to implement and simulated the required functions in FPGA.

  13. A Shared Memory Module for Asynchronous Arrays of Processors

    Directory of Open Access Journals (Sweden)

    Zhiyi Yu

    2007-05-01

    Full Text Available A shared memory module connecting multiple independently clocked processors is presented. The memory module itself is independently clocked, supports hardware address generation, mutual exclusion, and multiple addressing modes. The architecture supports independent address generation and data generation/consumption by different processors which increases efficiency and simplifies programming for many embedded and DSP tasks. Simultaneous access by different processors is arbitrated using a least-recently-serviced priority scheme. Simulations show high throughputs over a variety of memory loads. A standard cell implementation shares an 8 K-word SRAM among four processors, and can support a 64 K-word SRAM with no additional changes. It cycles at 555 MHz and occupies 1.2 mm2 in 0.18 μm CMOS.

  14. A Shared Memory Module for Asynchronous Arrays of Processors

    Directory of Open Access Journals (Sweden)

    Meeuwsen MichaelJ

    2007-01-01

    Full Text Available A shared memory module connecting multiple independently clocked processors is presented. The memory module itself is independently clocked, supports hardware address generation, mutual exclusion, and multiple addressing modes. The architecture supports independent address generation and data generation/consumption by different processors which increases efficiency and simplifies programming for many embedded and DSP tasks. Simultaneous access by different processors is arbitrated using a least-recently-serviced priority scheme. Simulations show high throughputs over a variety of memory loads. A standard cell implementation shares an 8 K-word SRAM among four processors, and can support a 64 K-word SRAM with no additional changes. It cycles at 555 MHz and occupies 1.2 mm2 in 0.18 μm CMOS.

  15. Reconfigurable Communication Processor:A New Approach for Network Processor

    Institute of Scientific and Technical Information of China (English)

    孙华; 陈青山; 张文渊

    2003-01-01

    As the traditional RISC +ASIC/ASSP approach for network processor design can not meet the today'srequirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost theperformance to ASIC level while reserve the programmability of the traditional RISC based system. This papercovers both the hardware architecture and the software development environment architecture.

  16. Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA

    Science.gov (United States)

    Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

    2013-03-01

    With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.

  17. Evaluating New Architectural Features Of The Intel(R Xeon(R 7500 Processor For Hpc Workloads

    Directory of Open Access Journals (Sweden)

    Paweł Gepner

    2011-01-01

    Full Text Available In this paper we take a look at what the Intel Xeon Processor 7500 family, code namedNehalem-EX, brings to high performance computing. We compare two families of Intel Xeonbased systems (Intel Xeon 7500 and Intel Xeon 5600 and present a performance evolutionof 16 node clusters based on these CPUs. We compare CPU generations utilizing dual socketplatforms and a cluster across a number of HPC benchmarks and focused on differentperformance field and aspect. We will evaluate also technologies and features like Intels HyperThreading Technology (HT and Intel Turbo Boost Technology (Turbo Mode and theperformance implication of these technologies for HPC.

  18. Reducing the computational requirements for simulating tunnel fires by combining multiscale modelling and multiple processor calculation

    DEFF Research Database (Denmark)

    Vermesi, Izabella; Rein, Guillermo; Colella, Francesco

    2017-01-01

    directly. The feasibility analysis showed a difference of only 2% in temperature results from the published reference work that was performed with Ansys Fluent (Colella et al., 2010). The reduction in simulation time was significantly larger when using multiscale modelling than when performing multiple...

  19. A framework for general sparse matrix-matrix multiplication on GPUs and heterogeneous processors

    DEFF Research Database (Denmark)

    Liu, Weifeng; Vinter, Brian

    2015-01-01

    General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle...

  20. Optimal speech codec implementation on ARM9E (v5E architecture) RISC processor for next-generation mobile multimedia

    Science.gov (United States)

    Bangla, Ajay Kumar; Vinay, M. K.; Suresh Babu, P. V.

    2004-01-01

    The mobile phone is undergoing a rapid evolution from a voice and limited text-messaging device to a complete multimedia client. RISC processors are predominantly used in these devices due to low cost, time to market and power consumption. The growing demand for signal processing performance on these platforms has triggered a convergence of RISC, CISC and DSP technologies on to a single core/system. This convergence leads to a multitude of challenges for optimal usage of available processing power. Voice codecs, which have been traditionally implemented on DSP platforms, have been adapted to sole RISC platforms as well. In this paper, the issues involved in optimizing a standard vocoder for RISC-DSP convergence platform (DSP enhanced RISC platforms) are addressed. Our optimization techniques are based on identification of algorithms, which could exploit either the DSP features or the RISC features or both. A few algorithmic modifications have also been suggested. By a systematic application of these optimization techniques for a GSM-AMR (NB) codec on ARM9E core, we could achieve more than 77% improvement over the baseline codec and almost 33% over that optimized for a RISC platform (ARM9T) alone in terms of processing cycle requirements. The optimization techniques outlined are generic in nature and are applicable to other vocoders on similar 'application-platform" combinations.

  1. Embedded Processor Oriented Compiler Infrastructure

    Directory of Open Access Journals (Sweden)

    DJUKIC, M.

    2014-08-01

    Full Text Available In the recent years, research of special compiler techniques and algorithms for embedded processors broaden the knowledge of how to achieve better compiler performance in irregular processor architectures. However, industrial strength compilers, besides ability to generate efficient code, must also be robust, understandable, maintainable, and extensible. This raises the need for compiler infrastructure that provides means for convenient implementation of embedded processor oriented compiler techniques. Cirrus Logic Coyote 32 DSP is an example that shows how traditional compiler infrastructure is not able to cope with the problem. That is why the new compiler infrastructure was developed for this processor, based on research. in the field of embedded system software tools and experience in development of industrial strength compilers. The new infrastructure is described in this paper. Compiler generated code quality is compared with code generated by the previous compiler for the same processor architecture.

  2. A Mobile Service Oriented Multiple Object Tracking Augmented Reality Architecture for Education and Learning Experiences

    Science.gov (United States)

    Rattanarungrot, Sasithorn; White, Martin; Newbury, Paul

    2014-01-01

    This paper describes the design of our service-oriented architecture to support mobile multiple object tracking augmented reality applications applied to education and learning scenarios. The architecture is composed of a mobile multiple object tracking augmented reality client, a web service framework, and dynamic content providers. Tracking of…

  3. Green Building between Tradition and Modernity Study Comparative Analysis between Conventional Methods and Updated Styles of Design and Architecture Processors

    Directory of Open Access Journals (Sweden)

    H Elshimy

    2017-03-01

    Full Text Available Green house   concept appeared from the ancient to the modern age ages and there is a tendency to use a traditional architecture with a pristine ecological environment areas and through sophisticated systems arrived to modern systems of the upgraded systems by Treatment architectural achieve environmental   sustainability   in   recent   years,   sustainability concept has become the common interest of numerous disciplines. The reason for this popularity is to perform the sustainable development. The Concept of Green Architecture, also known as "sustainable architecture” or “green house,” is the theory, science and style of buildings designed and constructed in accordance   with environmentally   friendly   principles.   Green house strives to minimize the number of resources consumed in the   building's  construction,   use   and   operation,   as  well  as curtailing  the  harm  done  to  the  environment  through  the emission, pollution and waste of its components.To design, construct, operate and maintain buildings energy, water and new materials are utilized as well as amounts of waste causing negative effects to health and environment is generated. In order to limit these effects and design environmentally sound and resource efficient buildings; "green building systems" must be introduced, clarified, understood and practiced.This paper aims at highlighting these difficult and complex issues of sustainability which encompass the scope of almost every aspect of human life.

  4. Tiled Multicore Processors

    Science.gov (United States)

    Taylor, Michael B.; Lee, Walter; Miller, Jason E.; Wentzlaff, David; Bratt, Ian; Greenwald, Ben; Hoffmann, Henry; Johnson, Paul R.; Kim, Jason S.; Psota, James; Saraf, Arvind; Shnidman, Nathan; Strumpen, Volker; Frank, Matthew I.; Amarasinghe, Saman; Agarwal, Anant

    For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x-9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are coded in a stream language or optimized by hand.

  5. Secure Co-processor and Billboard Manager Based Architecture Help to Protect & Store the Citrix Xenserver Based Virtual Data.

    Directory of Open Access Journals (Sweden)

    Debabrata Sarddar

    2014-01-01

    Full Text Available Any discussion of Cloud computing typically begins with virtualization. Virtualization is critical to cloud computing because it simplifies the delivery of services by providing a platform for optimizing complex IT resources in a scalable manner, which is what makes cloud computing so cost effective. Desktop virtualization, often called client virtualization, is a virtualization technology used to separate a computer desktop environment from the physical computer. Desktop virtualization is considered a type of client-server computing model because the "virtualized" desktop is stored on a centralized, or remote, server and not the physical machine being virtualized. Desktop virtualization "virtualizes desktop computers" and these virtual desktop environments are "served" to users on the network. In this paper, we proposed a secure cloud data center architecture that made by an application virtualization product like citrix xenapp/citrix xen desktop and with a proposed model that help us to encrypt and store the data like virtualized desktop or virtualized application in a suitable storage area.

  6. Effect of Thread Level Parallelism on the Performance of Optimum Architecture for Embedded Applications

    CERN Document Server

    Alipour, Mehdi

    2012-01-01

    According to the increasing complexity of network application and internet traffic, network processor as a subset of embedded processors have to process more computation intensive tasks. By scaling down the feature size and emersion of chip multiprocessors (CMP) that are usually multi-thread processors, the performance requirements are somehow guaranteed. As multithread processors are the heir of uni-thread processors and there isn't any general design flow to design a multithread embedded processor, in this paper we perform a comprehensive design space exploration for an optimum uni-thread embedded processor based on the limited area and power budgets. Finally we run multiple threads on this architecture to find out the maximum thread level parallelism (TLP) based on performance per power and area optimum uni-thread architecture.

  7. Monte Carlo simulations on SIMD computer architectures

    Energy Technology Data Exchange (ETDEWEB)

    Burmester, C.P.; Gronsky, R. [Lawrence Berkeley Lab., CA (United States); Wille, L.T. [Florida Atlantic Univ., Boca Raton, FL (United States). Dept. of Physics

    1992-03-01

    Algorithmic considerations regarding the implementation of various materials science applications of the Monte Carlo technique to single instruction multiple data (SMM) computer architectures are presented. In particular, implementation of the Ising model with nearest, next nearest, and long range screened Coulomb interactions on the SIMD architecture MasPar MP-1 (DEC mpp-12000) series of massively parallel computers is demonstrated. Methods of code development which optimize processor array use and minimize inter-processor communication are presented including lattice partitioning and the use of processor array spanning tree structures for data reduction. Both geometric and algorithmic parallel approaches are utilized. Benchmarks in terms of Monte Carlo updates per second for the MasPar architecture are presented and compared to values reported in the literature from comparable studies on other architectures.

  8. A Free Market Architecture for Coordinating Multiple Robots

    Science.gov (United States)

    1999-12-01

    L. E., “ALLIANCE: An Architecture for Fault Tolerant Multi-Robot Cooperation”, IEEE Transactions on Robotics and Automation, Vol. 14, No.2, pp. 220...Automation, pp. 582-587, 1993. 18. 6FKQHLGHU)RQWiQ00DWDULü0-³7HUULWRULDO0XOWL5RERW7DVN’LYLVLRQ´ IEEE Transactions on Robotics and

  9. Making CSB + -Trees Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....

  10. High Performance Ethernet Packet Processor Core for Next Generation Networks

    Directory of Open Access Journals (Sweden)

    Raja Jitendra Nayaka

    2012-10-01

    Full Text Available As the demand for high speed Internet significantly increasing to meet the requirement of large datatransfers, real-time communication and High Definition ( HD multimedia transfer over IP, the IP basednetwork products architecture must evolve and change. Application specific processors require highperformance, low power and high degree of programmability is the limitation in many general processorbased applications. This paper describes the design of Ethernet packet processor for system-on-chip (SoCwhich performs all core packet processing functions, including segmentation and reassembly, packetizationclassification, route and queue management which will speedup switching/routing performance making itmore suitable for Next Generation Networks (NGN. Ethernet packet processor design can be configuredfor use with multiple projects targeted to a FPGA device the system is designed to support 1/10/20/40/100Gigabit links with a speed and performance advantage. VHDL has been used to implement and simulatedthe required functions in FPGA.

  11. Making CSB+-Tree Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...

  12. A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units

    OpenAIRE

    Kreutzer, Moritz; Hager, Georg; Wellein, Gerhard; Fehske, Holger; Bishop, Alan R.

    2013-01-01

    Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most effi...

  13. Hardware multiplier processor

    Science.gov (United States)

    Pierce, Paul E.

    1986-01-01

    A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

  14. An architecture model for multiple disease management information systems.

    Science.gov (United States)

    Chen, Lichin; Yu, Hui-Chu; Li, Hao-Chun; Wang, Yi-Van; Chen, Huang-Jen; Wang, I-Ching; Wang, Chiou-Shiang; Peng, Hui-Yu; Hsu, Yu-Ling; Chen, Chi-Huang; Chuang, Lee-Ming; Lee, Hung-Chang; Chung, Yufang; Lai, Feipei

    2013-04-01

    Disease management is a program which attempts to overcome the fragmentation of healthcare system and improve the quality of care. Many studies have proven the effectiveness of disease management. However, the case managers were spending the majority of time in documentation, coordinating the members of the care team. They need a tool to support them with daily practice and optimizing the inefficient workflow. Several discussions have indicated that information technology plays an important role in the era of disease management. Whereas applications have been developed, it is inefficient to develop information system for each disease management program individually. The aim of this research is to support the work of disease management, reform the inefficient workflow, and propose an architecture model that enhance on the reusability and time saving of information system development. The proposed architecture model had been successfully implemented into two disease management information system, and the result was evaluated through reusability analysis, time consumed analysis, pre- and post-implement workflow analysis, and user questionnaire survey. The reusability of the proposed model was high, less than half of the time was consumed, and the workflow had been improved. The overall user aspect is positive. The supportiveness during daily workflow is high. The system empowers the case managers with better information and leads to better decision making.

  15. Functional Verification of Enhanced RISC Processor

    OpenAIRE

    SHANKER NILANGI; SOWMYA L

    2013-01-01

    This paper presents design and verification of a 32-bit enhanced RISC processor core having floating point computations integrated within the core, has been designed to reduce the cost and complexity. The designed 3 stage pipelined 32-bit RISC processor is based on the ARM7 processor architecture with single precision floating point multiplier, floating point adder/subtractor for floating point operations and 32 x 32 booths multiplier added to the integer core of ARM7. The binary representati...

  16. Design Principles for Synthesizable Processor Cores

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

    2012-01-01

    As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...... through the use of micro-benchmarks that our principles guide the design of a processor core that improves performance by an average of 38% over a similar Xilinx MicroBlaze configuration....

  17. Parallel k-means++ for Multiple Shared-Memory Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Mackey, Patrick S.; Lewis, Robert R.

    2016-09-22

    In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varying data sizes.

  18. Fuzzy Motivations in a Multiple Agent Behaviour-Based Architecture

    Directory of Open Access Journals (Sweden)

    Tomás V. Arredondo

    2013-08-01

    Full Text Available In this article we introduce a blackboard- based multiple agent system framework that considers biologically-based motivations as a means to develop a user friendly interface. The framework includes a population-based heuristic as well as a fuzzy logic- based inference system used toward scoring system behaviours. The heuristic provides an optimization environment and the fuzzy scoring mechanism is used to give a fitness score to possible system outputs (i.e. solutions. This framework results in the generation of complex behaviours which respond to previously specified motivations. Our multiple agent blackboard and motivation-based framework is validated in a low cost mobile robot specifically built for this task. The robot was used in several navigation experiments and the motivation profile that was considered included "curiosity", "homing", "energy" and "missions". Our results show that this motivation-based approach permits a low cost multiple agent-based autonomous mobile robot to acquire a diverse set of fit behaviours that respond well to user and performance expectations. These results also validate our multiple agent framework as an incremental, flexible and practical method for the development of robust multiple agent systems.

  19. Ultrafast Fourier-transform parallel processor

    Energy Technology Data Exchange (ETDEWEB)

    Greenberg, W.L.

    1980-04-01

    A new, flexible, parallel-processing architecture is developed for a high-speed, high-precision Fourier transform processor. The processor is intended for use in 2-D signal processing including spatial filtering, matched filtering and image reconstruction from projections.

  20. The TM3270 Media-processor

    NARCIS (Netherlands)

    van de Waerdt, J.W.

    2006-01-01

    I n this thesis, we present the TM3270 VLIW media-processor, the latest of TriMedia processors, and describe the innovations with respect to its prede- cessor: the TM3260. We describe enhancements to the load/store unit design, such as a new data prefetching technique, and architectural

  1. The TM3270 Media-processor

    NARCIS (Netherlands)

    van de Waerdt, J.W.

    2006-01-01

    I n this thesis, we present the TM3270 VLIW media-processor, the latest of TriMedia processors, and describe the innovations with respect to its prede- cessor: the TM3260. We describe enhancements to the load/store unit design, such as a new data prefetching technique, and architectural enhancements

  2. Multithreading architecture

    CERN Document Server

    Nemirovsky, Mario

    2013-01-01

    Multithreaded architectures now appear across the entire range of computing devices, from the highest-performing general purpose devices to low-end embedded processors. Multithreading enables a processor core to more effectively utilize its computational resources, as a stall in one thread need not cause execution resources to be idle. This enables the computer architect to maximize performance within area constraints, power constraints, or energy constraints. However, the architectural options for the processor designer or architect looking to implement multithreading are quite extensive and

  3. Enhanced Montgomery Multiplication on DSP Architectures for Embedded Public-Key Cryptosystems

    Directory of Open Access Journals (Sweden)

    Gastaldo P

    2008-01-01

    Full Text Available Abstract Montgomery's algorithm is a popular technique to speed up modular multiplications in public-key cryptosystems. This paper tackles the efficient support of modular exponentiation on inexpensive circuitry for embedded security services and proposes a variant of the finely integrated product scanning (FIPS algorithm that is targeted to digital signal processors. The general approach improves on the basic FIPS formulation by removing potential inefficiencies and boosts the exploitation of computing resources. The reformulation of the basic FIPS structure results in a general approach that balances computational efficiency and flexibility. Experimental results on commercial DSP platforms confirm both the method's validity and its effectiveness.

  4. Multiple Estimation Architecture in Discrete-Time Adaptive Mixing Control

    Directory of Open Access Journals (Sweden)

    Simone Baldi

    2013-05-01

    Full Text Available Adaptive mixing control (AMC is a recently developed control scheme for uncertain plants, where the control action coming from a bank of precomputed controller is mixed based on the parameter estimates generated by an on-line parameter estimator. Even if the stability of the control scheme, also in the presence of modeling errors and disturbances, has been shown analytically, its transient performance might be sensitive to the initial conditions of the parameter estimator. In particular, for some initial conditions, transient oscillations may not be acceptable in practical applications. In order to account for such a possible phenomenon and to improve the learning capability of the adaptive scheme, in this paper a new mixing architecture is developed, involving the use of parallel parameter estimators, or multi-estimators, each one working on a small subset of the uncertainty set. A supervisory logic, using performance signals based on the past and present estimation error, selects the parameter estimate to determine the mixing of the controllers. The stability and robustness properties of the resulting approach, referred to as multi-estimator adaptive mixing control (Multi-AMC, are analytically established. Besides, extensive simulations demonstrate that the scheme improves the transient performance of the original AMC with a single estimator. The control scheme and the analysis are carried out in a discrete-time framework, for easier implementation of the method in digital control.

  5. First Cluster Algorithm Special Purpose Processor

    Science.gov (United States)

    Talapov, A. L.; Andreichenko, V. B.; Dotsenko S., Vi.; Shchur, L. N.

    We describe the architecture of the special purpose processor built to realize in hardware cluster Wolff algorithm, which is not hampered by a critical slowing down. The processor simulates two-dimensional Ising-like spin systems. With minor changes the same very effective architecture, which can be defined as a Memory Machine, can be used to study phase transitions in a wide range of models in two or three dimensions.

  6. Multiplication of sparse Laurent polynomials and Poisson series on modern hardware architectures

    CERN Document Server

    Biscani, Francesco

    2010-01-01

    In this paper we present two algorithms for the multiplication of sparse Laurent polynomials and Poisson series (the latter being algebraic structures commonly arising in Celestial Mechanics from the application of perturbation theories). Both algorithms first employ the Kronecker substitution technique to reduce multivariate multiplication to univariate multiplication, and then use the schoolbook method to perform the univariate multiplication. The first algorithm, suitable for moderately-sparse multiplication, uses the exponents of the monomials resulting from the univariate multiplication as trivial hash values in a one dimensional lookup array of coefficients. The second algorithm, suitable for highly-sparse multiplication, uses a cache-optimised hash table which stores the coefficient-exponent pairs resulting from the multiplication using the exponents as keys. Both algorithms have been implemented with attention to modern computer hardware architectures. Particular care has been devoted to the efficient...

  7. Dynamic information architecture system (DIAS) : multiple model simulation management.

    Energy Technology Data Exchange (ETDEWEB)

    Simunich, K. L.; Sydelko, P.; Dolph, J.; Christiansen, J.

    2002-05-13

    Dynamic Information Architecture System (DIAS) is a flexible, extensible, object-based framework for developing and maintaining complex multidisciplinary simulations of a wide variety of application contexts. The modeling domain of a specific DIAS-based simulation is determined by (1) software Entity (domain-specific) objects that represent the real-world entities that comprise the problem space (atmosphere, watershed, human), and (2) simulation models and other data processing applications that express the dynamic behaviors of the domain entities. In DIAS, models communicate only with Entity objects, never with each other. Each Entity object has a number of Parameter and Aspect (of behavior) objects associated with it. The Parameter objects contain the state properties of the Entity object. The Aspect objects represent the behaviors of the Entity object and how it interacts with other objects. DIAS extends the ''Object'' paradigm by abstraction of the object's dynamic behaviors, separating the ''WHAT'' from the ''HOW.'' DIAS object class definitions contain an abstract description of the various aspects of the object's behavior (the WHAT), but no implementation details (the HOW). Separate DIAS models/applications carry the implementation of object behaviors (the HOW). Any model deemed appropriate, including existing legacy-type models written in other languages, can drive entity object behavior. The DIAS design promotes plug-and-play of alternative models, with minimal recoding of existing applications. The DIAS Context Builder object builds a constructs or scenario for the simulation, based on developer specification and user inputs. Because DIAS is a discrete event simulation system, there is a Simulation Manager object with which all events are processed. Any class that registers to receive events must implement an event handler (method) to process the event during execution. Event handlers

  8. The TMS34010 graphic processor - an architecture for image visualization in NMR tomography; O processador grafico TMS34010 - uma arquitetura para visualizacao de imagem em tomografia por RMN

    Energy Technology Data Exchange (ETDEWEB)

    Slaets, Jan Frans Willem; Paiva, Maria Stela Veludo de; Almeida, Lirio O.B

    1989-12-31

    This abstract presents a description of the minimum system implemented with the graphic processor TMS34010, which will be used in the reconstruction, treatment and interpretation f images obtained by NMR tomography. The project is being developed in the LIE (Electronic Instrumentation Laboratory), of the Sao Carlos Chemistry and Physical Institute, S P, Brazil and is already in operation 4 refs., 7 figs.

  9. Preliminary design of an advanced programmable digital filter network for large passive acoustic ASW systems. [Parallel processor

    Energy Technology Data Exchange (ETDEWEB)

    McWilliams, T.; Widdoes, Jr., L. C.; Wood, L.

    1976-09-30

    The design of an extremely high performance programmable digital filter of novel architecture, the LLL Programmable Digital Filter, is described. The digital filter is a high-performance multiprocessor having general purpose applicability and high programmability; it is extremely cost effective either in a uniprocessor or a multiprocessor configuration. The architecture and instruction set of the individual processor was optimized with regard to the multiple processor configuration. The optimal structure of a parallel processing system was determined for addressing the specific Navy application centering on the advanced digital filtering of passive acoustic ASW data of the type obtained from the SOSUS net. 148 figures. (RWR)

  10. Using Multiple FPGA Architectures for Real-time Processing of Low-level Machine Vision Functions

    Science.gov (United States)

    Thomas H. Drayer; William E. King; Philip A. Araman; Joseph G. Tront; Richard W. Conners

    1995-01-01

    In this paper, we investigate the use of multiple Field Programmable Gate Array (FPGA) architectures for real-time machine vision processing. The use of FPGAs for low-level processing represents an excellent tradeoff between software and special purpose hardware implementations. A library of modules that implement common low-level machine vision operations is presented...

  11. A high-accuracy optical linear algebra processor for finite element applications

    Science.gov (United States)

    Casasent, D.; Taylor, B. K.

    1984-01-01

    Optical linear processors are computationally efficient computers for solving matrix-matrix and matrix-vector oriented problems. Optical system errors limit their dynamic range to 30-40 dB, which limits their accuray to 9-12 bits. Large problems, such as the finite element problem in structural mechanics (with tens or hundreds of thousands of variables) which can exploit the speed of optical processors, require the 32 bit accuracy obtainable from digital machines. To obtain this required 32 bit accuracy with an optical processor, the data can be digitally encoded, thereby reducing the dynamic range requirements of the optical system (i.e., decreasing the effect of optical errors on the data) while providing increased accuracy. This report describes a new digitally encoded optical linear algebra processor architecture for solving finite element and banded matrix-vector problems. A linear static plate bending case study is described which quantities the processor requirements. Multiplication by digital convolution is explained, and the digitally encoded optical processor architecture is advanced.

  12. Invasive tightly coupled processor arrays

    CERN Document Server

    LARI, VAHID

    2016-01-01

    This book introduces new massively parallel computer (MPSoC) architectures called invasive tightly coupled processor arrays. It proposes strategies, architecture designs, and programming interfaces for invasive TCPAs that allow invading and subsequently executing loop programs with strict requirements or guarantees of non-functional execution qualities such as performance, power consumption, and reliability. For the first time, such a configurable processor array architecture consisting of locally interconnected VLIW processing elements can be claimed by programs, either in full or in part, using the principle of invasive computing. Invasive TCPAs provide unprecedented energy efficiency for the parallel execution of nested loop programs by avoiding any global memory access such as GPUs and may even support loops with complex dependencies such as loop-carried dependencies that are not amenable to parallel execution on GPUs. For this purpose, the book proposes different invasion strategies for claiming a desire...

  13. Breadboard Signal Processor for Arraying DSN Antennas

    Science.gov (United States)

    Jongeling, Andre; Sigman, Elliott; Chandra, Kumar; Trinh, Joseph; Soriano, Melissa; Navarro, Robert; Rogstad, Stephen; Goodhart, Charles; Proctor, Robert; Jourdan, Michael; hide

    2008-01-01

    A recently developed breadboard version of an advanced signal processor for arraying many antennas in NASA s Deep Space Network (DSN) can accept inputs in a 500-MHz-wide frequency band from six antennas. The next breadboard version is expected to accept inputs from 16 antennas, and a following developed version is expected to be designed according to an architecture that will be scalable to accept inputs from as many as 400 antennas. These and similar signal processors could also be used for combining multiple wide-band signals in non-DSN applications, including very-long-baseline interferometry and telecommunications. This signal processor performs functions of a wide-band FX correlator and a beam-forming signal combiner. [The term "FX" signifies that the digital samples of two given signals are fast Fourier transformed (F), then the fast Fourier transforms of the two signals are multiplied (X) prior to accumulation.] In this processor, the signals from the various antennas are broken up into channels in the frequency domain (see figure). In each frequency channel, the data from each antenna are correlated against the data from each other antenna; this is done for all antenna baselines (that is, for all antenna pairs). The results of the correlations are used to obtain calibration data to align the antenna signals in both phase and delay. Data from the various antenna frequency channels are also combined and calibration corrections are applied. The frequency-domain data thus combined are then synthesized back to the time domain for passing on to a telemetry receiver

  14. Keystone Business Models for Network Security Processors

    Directory of Open Access Journals (Sweden)

    Arthur Low

    2013-07-01

    Full Text Available Network security processors are critical components of high-performance systems built for cybersecurity. Development of a network security processor requires multi-domain experience in semiconductors and complex software security applications, and multiple iterations of both software and hardware implementations. Limited by the business models in use today, such an arduous task can be undertaken only by large incumbent companies and government organizations. Neither the “fabless semiconductor” models nor the silicon intellectual-property licensing (“IP-licensing” models allow small technology companies to successfully compete. This article describes an alternative approach that produces an ongoing stream of novel network security processors for niche markets through continuous innovation by both large and small companies. This approach, referred to here as the "business ecosystem model for network security processors", includes a flexible and reconfigurable technology platform, a “keystone” business model for the company that maintains the platform architecture, and an extended ecosystem of companies that both contribute and share in the value created by innovation. New opportunities for business model innovation by participating companies are made possible by the ecosystem model. This ecosystem model builds on: i the lessons learned from the experience of the first author as a senior integrated circuit architect for providers of public-key cryptography solutions and as the owner of a semiconductor startup, and ii the latest scholarly research on technology entrepreneurship, business models, platforms, and business ecosystems. This article will be of interest to all technology entrepreneurs, but it will be of particular interest to owners of small companies that provide security solutions and to specialized security professionals seeking to launch their own companies.

  15. Communications systems and methods for subsea processors

    Science.gov (United States)

    Gutierrez, Jose; Pereira, Luis

    2016-04-26

    A subsea processor may be located near the seabed of a drilling site and used to coordinate operations of underwater drilling components. The subsea processor may be enclosed in a single interchangeable unit that fits a receptor on an underwater drilling component, such as a blow-out preventer (BOP). The subsea processor may issue commands to control the BOP and receive measurements from sensors located throughout the BOP. A shared communications bus may interconnect the subsea processor and underwater components and the subsea processor and a surface or onshore network. The shared communications bus may be operated according to a time division multiple access (TDMA) scheme.

  16. Combined Integer and Variable Precision (CIVP) Floating Point Multiplication Architecture for FPGAs

    CERN Document Server

    Thapliyal, Himanshu; Bajpai, Rajnish; Sharma, Kamal K

    2007-01-01

    In this paper, we propose an architecture/methodology for making FPGAs suitable for integer as well as variable precision floating point multiplication. The proposed work will of great importance in applications which requires variable precision floating point multiplication such as multi-media processing applications. In the proposed architecture/methodology, we propose the replacement of existing 18x18 bit and 25x18 bit dedicated multipliers in FPGAs with dedicated 24x24 bit and 24x9 bit multipliers, respectively. We have proved that our approach of providing the dedicated 24x24 bit and 24x9 bit multipliers in FPGAs will make them efficient for performing integer as well as single precision, double precision, and Quadruple precision floating point multiplications.

  17. ASIP Approach for Multimedia Applications Based on a Scalable VLIW DSP Architecture

    Institute of Scientific and Technical Information of China (English)

    ZHANG Yanjun; HE Hu; SHEN Zheng; SUN Yihe

    2009-01-01

    The rapid development of multimedia techniques has increased the demands on multimedia processors.This paper presents a new design method to quickly design high performance processors for new multimedia applications.In this approach,a configurable processor based on the very long instruction-set word architecture is used as the basic core for designers to easily configure new processor cores for multimedia algorithm.Specific instructions designed for multimedia applications efficiently improve the performance of the target processor.Functions not implemented in the digital signal processor (DSP) core can be easily integrated into the target processor as user-defined hardware to increase the performance.Several examples are given based on the architecture.The results show that the processor performance is enhanced approximately 4 times on the H.263 codec and that the processor outperforms both DSPs and single instruction multiple data (SIMD) multimedia extension architectures by up to 8 times when computing the 2-D-IDCT.

  18. Processor Cache

    NARCIS (Netherlands)

    Boncz, P.A.; Liu, L.; Özsu, M. Tamer

    2008-01-01

    To hide the high latencies of DRAM access, modern computer architecture now features a memory hierarchy that besides DRAM also includes SRAM cache memories, typically located on the CPU chip. Memory access first check these caches, which takes only a few cycles. Only if the needed data is not found,

  19. Taxonomy of Data Prefetching for Multicore Processors

    Institute of Scientific and Technical Information of China (English)

    Surendra Byna; Yong Chen; Xian-He Sun

    2009-01-01

    Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream.While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.

  20. Architecture Design and Performance Analysis of Supervisory Control System of Multiple UAVs

    Directory of Open Access Journals (Sweden)

    Guozhong Zhang

    2015-04-01

    Full Text Available Although UAV systems are currently controlled by a group of people, in the future, increased automation could allow a single operator to supervise multiple UAVs. Operators will be involved in the mission planning, imagery analysis, weapon control, and contingency interventions. This study examines the architecture and prototype of multiple UAVs supervisory control system. Firstly, the architecture for testing and evaluating human supervisory system controlling multiple UAVs is devised and each sub-system is described in detail. Then a prototype test bed of multiple UAVs supervisory control for demonstrating architecture and adaptive levels of autonomy is built. Finally, with the test bed, the impact of dynamic role allocation on system performance is studied based on quantitative criteria of wait times and operator utilisation. It is shown by simulation that dynamic role allocation can effectively shorten wait times, and eventually improve the system performance.Defence Science Journal, Vol. 65, No. 2, March 2015, pp.93-98, DOI:http://dx.doi.org/10.14429/dsj.65.5837

  1. Verilog Implementation of 32-Bit CISC Processor

    Directory of Open Access Journals (Sweden)

    P.Kanaka Sirisha

    2016-04-01

    Full Text Available The Project deals with the design of the 32-Bit CISC Processor and modeling of its components using Verilog language. The Entire Processor uses 32-Bit bus to deal with all the registers and the memories. This Processor implements various arithmetic, logical, Data Transfer operations etc., using variable length instructions, which is the core property of the CISC Architecture. The Processor also supports various addressing modes to perform a 32-Bit instruction. Our Processor uses Harvard Architecture (i.e., to have a separate program and data memory and hence has different buses to negotiate with the Program Memory and Data Memory individually. This feature enhances the speed of our processor. Hence it has two different Program Counters to point to the memory locations of the Program Memory and Data Memory.Our processor has ‘Instruction Queuing’ which enables it to save the time needed to fetch the instruction and hence increases the speed of operation. ‘Interrupt Service Routine’ is provided in our Processor to make it address the Interrupts.

  2. Matrix multiplication operations with data pre-conditioning in a high performance computing architecture

    Science.gov (United States)

    Eichenberger, Alexandre E; Gschwind, Michael K; Gunnels, John A

    2013-11-05

    Mechanisms for performing matrix multiplication operations with data pre-conditioning in a high performance computing architecture are provided. A vector load operation is performed to load a first vector operand of the matrix multiplication operation to a first target vector register. A load and splat operation is performed to load an element of a second vector operand and replicating the element to each of a plurality of elements of a second target vector register. A multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the matrix multiplication operation. The partial product of the matrix multiplication operation is accumulated with other partial products of the matrix multiplication operation.

  3. Phase-Synchronization Early Epileptic Seizure Detector VLSI Architecture.

    Science.gov (United States)

    Abdelhalim, K; Smolyakov, V; Genov, R

    2011-10-01

    A low-power VLSI processor architecture that computes in real time the magnitude and phase-synchronization of two input neural signals is presented. The processor is a part of an envisioned closed-loop implantable microsystem for adaptive neural stimulation. The architecture uses three CORDIC processing cores that require shift-and-add operations but no multiplication. The 10-bit processor synthesized and prototyped in a standard 1.2 V 0.13 μm CMOS technology utilizes 41,000 logic gates. It dissipates 3.6 μW per input pair, and provides 1.7 kS/s per-channel throughput when clocked at 2.5 MHz. The power scales linearly with the number of input channels or the sampling rate. The efficacy of the processor in early epileptic seizure detection is validated on human intracranial EEG data.

  4. Generating and executing programs for a floating point single instruction multiple data instruction set architecture

    Science.gov (United States)

    Gschwind, Michael K

    2013-04-16

    Mechanisms for generating and executing programs for a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA) are provided. A computer program product comprising a computer recordable medium having a computer readable program recorded thereon is provided. The computer readable program, when executed on a computing device, causes the computing device to receive one or more instructions and execute the one or more instructions using logic in an execution unit of the computing device. The logic implements a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA), based on data stored in a vector register file of the computing device. The vector register file is configured to store both scalar and floating point values as vectors having a plurality of vector elements.

  5. A Loosely Coupled Control Architecture Based on Agent and CORBA for Multiple Robots

    Institute of Scientific and Technical Information of China (English)

    Wu Shandong(吴山东); Chen Yimin; He Yongyi

    2003-01-01

    With the rapid development of information technology, adopting advanced distributed computing technology to construct robot control system is becoming an effective approach gradually. This paper proposes a distributed loosely coupled software architecture based on Agent and CORBA to control multiple robots. This model provides the robot user with agent control units at the semantic level and CORBA provides function interfaces to agent at the syntax level, which shows a good adaptability, flexibility and transparence.

  6. Median and Morphological Specialized Processors for a Real-Time Image Data Processing

    Directory of Open Access Journals (Sweden)

    Kazimierz Wiatr

    2002-01-01

    Full Text Available This paper presents the considerations on selecting a multiprocessor MISD architecture for fast implementation of the vision image processing. Using the author′s earlier experience with real-time systems, implementing of specialized hardware processors based on the programmable FPGA systems has been proposed in the pipeline architecture. In particular, the following processors are presented: median filter and morphological processor. The structure of a universal reconfigurable processor developed has been proposed as well. Experimental results are presented as delays on LCA level implementation for median filter, morphological processor, convolution processor, look-up-table processor, logic processor and histogram processor. These times compare with delays in general purpose processor and DSP processor.

  7. A lock circuit for a multi-core processor

    DEFF Research Database (Denmark)

    2015-01-01

    An integrated circuit comprising a multiple processor cores and a lock circuit that comprises a queue register with respective bits set or reset via respective, connections dedicated to respective processor cores, whereby the queue register identifies those among the multiple processor cores that...

  8. RASSP signal processing architectures

    Science.gov (United States)

    Shirley, Fred; Bassett, Bob; Letellier, J. P.

    1995-06-01

    The rapid prototyping of application specific signal processors (RASSP) program is an ARPA/tri-service effort to dramatically improve the process by which complex digital systems, particularly embedded signal processors, are specified, designed, documented, manufactured, and supported. The domain of embedded signal processing was chosen because it is important to a variety of military and commercial applications as well as for the challenge it presents in terms of complexity and performance demands. The principal effort is being performed by two major contractors, Lockheed Sanders (Nashua, NH) and Martin Marietta (Camden, NJ). For both, improvements in methodology are to be exercised and refined through the performance of individual 'Demonstration' efforts. The Lockheed Sanders' Demonstration effort is to develop an infrared search and track (IRST) processor. In addition, both contractors' results are being measured by a series of externally administered (by Lincoln Labs) six-month Benchmark programs that measure process improvement as a function of time. The first two Benchmark programs are designing and implementing a synthetic aperture radar (SAR) processor. Our demonstration team is using commercially available VME modules from Mercury Computer to assemble a multiprocessor system scalable from one to hundreds of Intel i860 microprocessors. Custom modules for the sensor interface and display driver are also being developed. This system implements either proprietary or Navy owned algorithms to perform the compute-intensive IRST function in real time in an avionics environment. Our Benchmark team is designing custom modules using commercially available processor ship sets, communication submodules, and reconfigurable logic devices. One of the modules contains multiple vector processors optimized for fast Fourier transform processing. Another module is a fiberoptic interface that accepts high-rate input data from the sensors and provides video-rate output data to a

  9. Pipelining and bypassing in a RISC/DSP processor

    Science.gov (United States)

    Yu, Guojun; Yao, Qingdong; Liu, Peng; Jiang, Zhidi; Li, Fuping

    2005-03-01

    This paper proposes pipelining and bypassing unit (BPU) design method in our 32-bit RISC/DSP processor: MediaDsp3201 (briefly, MD32). MD32 is realized in 0.18μm technology, 1.8v, 200MHz working clock and can achieve 200 million/s Multiply-Accumulate (MAC) operations. It merges RISC architecture and DSP computation capability thoroughly, achieves fundamental RISC, extended DSP and single instruction multiple data (SIMD) instruction set with various addressing modes in a unified and customized DSP pipeline stage architecture. We will first describe the pipeline structure of MD32, comparing it to typical RISC-style pipeline structure. And then we will study the validity of two bypassing schemes in terms of their effectiveness in resolving pipeline data hazards: Centralized and Distributed BPU design strategy (CBPU and DBPU). A bypassing circuit chain model is given for DBPU, which register read is only placed at ID pipe stage. Considering the processor"s working clock which is decided by the pipeline time delay, the optimization of circuit that serial select with priority is also analyzed in detail since the BPU consists of a long serial path for combination logic. Finally, the performance improvement is analyzed.

  10. On-board neural processor design for intelligent multisensor microspacecraft

    Science.gov (United States)

    Fang, Wai-Chi; Sheu, Bing J.; Wall, James

    1996-03-01

    A compact VLSI neural processor based on the Optimization Cellular Neural Network (OCNN) has been under development to provide a wide range of support for an intelligent remote sensing microspacecraft which requires both high bandwidth communication and high- performance computing for on-board data analysis, thematic data reduction, synergy of multiple types of sensors, and other advanced smart-sensor functions. The OCNN is developed with emphasis on its capability to find global optimal solutions by using a hardware annealing method. The hardware annealing function is embedded in the network. It is a parallel version of fast mean-field annealing in analog networks, and is highly efficient in finding globally optimal solutions for cellular neural networks. The OCNN is designed to perform programmable functions for fine-grained processing with annealing control to enhance the output quality. The OCNN architecture is a programmable multi-dimensional array of neurons which are locally connected with their local neurons. Major design features of the OCNN neural processor includes massively parallel neural processing, hardware annealing capability, winner-take-all mechanism, digitally programmable synaptic weights, and multisensor parallel interface. A compact current-mode VLSI design feasibility of the OCNN neural processor is demonstrated by a prototype 5 X 5-neuroprocessor array chip in a 2-micrometers CMOS technology. The OCNN operation theory, architecture, design and implementation, prototype chip, and system applications have been investigated in detail and presented in this paper.

  11. Computer organization, design, and architecture

    CERN Document Server

    Shiva, Sajjan G

    2007-01-01

    Suitable for a one- or two-semester undergraduate or beginning graduate course in computer science and computer engineering, Computer Organization, Design, and Architecture, Fourth Edition presents the operating principles, capabilities, and limitations of digital computers to enable development of complex yet efficient systems. With 40% updated material and four new chapters, this edition takes students through a solid, up-to-date exploration of single- and multiple-processor systems, embedded architectures, and performance evaluation. New to the Fourth Edition Additional material that cove

  12. Speedup bioinformatics applications on multicore-based processor using vectorizing and multithreading strategies.

    Science.gov (United States)

    Chaichoompu, Kridsadakorn; Kittitornkun, Surin; Tongsima, Sissades

    2007-12-30

    Many computational intensive bioinformatics software, such as multiple sequence alignment, population structure analysis, etc., written in C/C++ are not multicore-aware. A multicore processor is an emerging CPU technology that combines two or more independent processors into a single package. The Single Instruction Multiple Data-stream (SIMD) paradigm is heavily utilized in this class of processors. Nevertheless, most popular compilers including Microsoft Visual C/C++ 6.0, x86 gnu C-compiler gcc do not automatically create SIMD code which can fully utilize the advancement of these processors. To harness the power of the new multicore architecture certain compiler techniques must be considered. This paper presents a generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++. The proposed framework contains 2 main steps: multithreading and vectorizing strategies. After following the strategies, the application can achieve higher speedup by taking the advantage of multicore architecture technology. Due to the extremely fast interconnection networking among multiple cores, it is suggested that the proposed optimization could be more appropriate than making use of parallelization on a small cluster computer which has larger network latency and lower bandwidth.

  13. Fine Surveying and 3D Modeling Approach for Wooden Ancient Architecture via Multiple Laser Scanner Integration

    Directory of Open Access Journals (Sweden)

    Qingwu Hu

    2016-03-01

    Full Text Available A multiple terrestrial laser scanner (TLS integration approach is proposed for the fine surveying and 3D modeling of ancient wooden architecture in an ancient building complex of Wudang Mountains, which is located in very steep surroundings making it difficult to access. Three-level TLS with a scalable measurement distance and accuracy is presented for data collection to compensate for data missed because of mutual sheltering and scanning view limitations. A multi-scale data fusion approach is proposed for data registration and filtering of the different scales and separated 3D data. A point projection algorithm together with point cloud slice tools is designed for fine surveying to generate all types of architecture maps, such as plan drawings, facade drawings, section drawings, and doors and windows drawings. The section drawings together with slicing point cloud are presented for the deformation analysis of the building structure. Along with fine drawings and laser scanning data, the 3D models of the ancient architecture components are built for digital management and visualization. Results show that the proposed approach can achieve fine surveying and 3D documentation of the ancient architecture within 3 mm accuracy. In addition, the defects of scanning view and mutual sheltering can overcome to obtain the complete and exact structure in detail.

  14. Evaluating current processors performance and machines stability

    CERN Document Server

    Esposito, R; Tortone, G; Taurino, F M

    2003-01-01

    Accurately estimate performance of currently available processors is becoming a key activity, particularly in HENP environment, where high computing power is crucial. This document describes the methods and programs, opensource or freeware, used to benchmark processors, memory and disk subsystems and network connection architectures. These tools are also useful to stress test new machines, before their acquisition or before their introduction in a production environment, where high uptimes are requested.

  15. Libera Electron Beam Position Processor

    CERN Document Server

    Ursic, Rok

    2005-01-01

    Libera is a product family delivering unprecedented possibilities for either building powerful single station solutions or architecting complex feedback systems in the field of accelerator instrumentation and controls. This paper presents functionality and field performance of its first member, the electron beam position processor. It offers superior performance with multiple measurement channels delivering simultaneously position measurements in digital format with MHz kHz and Hz bandwidths. This all-in-one product, facilitating pulsed and CW measurements, is much more than simply a high performance beam position measuring device delivering micrometer level reproducibility with sub-micrometer resolution. Rich connectivity options and innate processing power make it a powerful feedback building block. By interconnecting multiple Libera electron beam position processors one can build a low-latency high throughput orbit feedback system without adding additional hardware. Libera electron beam position processor ...

  16. Parallel architecture for rapid image generation and analysis

    Energy Technology Data Exchange (ETDEWEB)

    Nerheim, R.J.

    1987-01-01

    A multiprocessor architecture inspired by the Disney multiplane camera is proposed. For many applications, this approach produces a natural mapping of processors to objects in a scene. Such a mapping promotes parallelism and reduces the hidden-surface work with minimal interprocessor communication and low-overhead cost. Existing graphics architectures store the final picture as a monolithic entity. The architecture here stores each object's image separately. It assembles the final composite picture from component images only when the video display needs to be refreshed. This organization simplifies the work required to animate moving objects that occlude other objects. In addition, the architecture has multiple processors that generate the component images in parallel. This further shortens the time needed to create a composite picture. In addition to generating images for animation, the architecture has the ability to decompose images.

  17. Enabling Future Robotic Missions with Multicore Processors

    Science.gov (United States)

    Powell, Wesley A.; Johnson, Michael A.; Wilmot, Jonathan; Some, Raphael; Gostelow, Kim P.; Reeves, Glenn; Doyle, Richard J.

    2011-01-01

    Recent commercial developments in multicore processors (e.g. Tilera, Clearspeed, HyperX) have provided an option for high performance embedded computing that rivals the performance attainable with FPGA-based reconfigurable computing architectures. Furthermore, these processors offer more straightforward and streamlined application development by allowing the use of conventional programming languages and software tools in lieu of hardware design languages such as VHDL and Verilog. With these advantages, multicore processors can significantly enhance the capabilities of future robotic space missions. This paper will discuss these benefits, along with onboard processing applications where multicore processing can offer advantages over existing or competing approaches. This paper will also discuss the key artchitecural features of current commercial multicore processors. In comparison to the current art, the features and advancements necessary for spaceflight multicore processors will be identified. These include power reduction, radiation hardening, inherent fault tolerance, and support for common spacecraft bus interfaces. Lastly, this paper will explore how multicore processors might evolve with advances in electronics technology and how avionics architectures might evolve once multicore processors are inserted into NASA robotic spacecraft.

  18. A Complete Multi-Processor System-on-Chip FPGA-Based Emulation Framework

    OpenAIRE

    Valle, Del; Pablo, G.; Atienza, David; Magan, Ivan; Flores, Javier G.; Perez, Esther A.; Mendias, Jose M.; Benini, Luca; De Micheli, Giovanni

    2006-01-01

    With the growing complexity in consumer embedded products and the improvements in process technology, Multi-Processor System-On-Chip (MPSoC) architectures have become widespread. These new systems are very complex to design as they must execute multiple complex real-time applications (e.g. video processing, or videogames), while meeting several additional design constraints (e.g. energy consumption or time-to-market). Therefore, mechanisms to efficiently explore the different possible HW-SW d...

  19. Making CSB+-Tree Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....

  20. Real-time Parallel Processing System Design and Implementation for Underwater Acoustic Communication Based on Multiple Processors

    Institute of Scientific and Technical Information of China (English)

    YAN Zhen-hua; HUANG Jian-guo; ZHANG Qun-fei; HE Cheng-bing

    2007-01-01

    ADSP-TS101 is a high performance DSP with good properties of parallel processing and high speed. According to the real-time processing requirements of underwater acoustic communication algorithms, a real-time parallel processing system with multi-channel synchronous sample, which is composed of multiple ADSP-TS101s, is designed and carried out.For the hardware design, field programmable gate array (FPGA) logical control is adopted for the design of multi-channel synchronous sample module and cluster/data flow associated pin connection mode is adopted for multiprocessing parallel processing configuration respectively. And the software is optimized by two kinds of communication ways: broadcast writing way through shared bus and point-to-point way through link ports. Through the whole system installation, connective debugging, and experiments in a lake, the results show that the real-time parallel processing system has good stability and real-time processing capability and meets the technical design requirements of real-time processing.

  1. The Telesupervised Adaptive Ocean Sensor Fleet (TAOSF) Architecture: Coordination of Multiple Oceanic Robot Boats

    Science.gov (United States)

    Elfes, Alberto; Podnar, Gregg W.; Dolan, John M.; Stancliff, Stephen; Lin, Ellie; Hosler, Jeffrey C.; Ames, Troy J.; Higinbotham, John; Moisan, John R.; Moisan, Tiffany A.; Kulczycki, Eric A.

    2008-01-01

    Earth science research must bridge the gap between the atmosphere and the ocean to foster understanding of Earth s climate and ecology. Ocean sensing is typically done with satellites, buoys, and crewed research ships. The limitations of these systems include the fact that satellites are often blocked by cloud cover, and buoys and ships have spatial coverage limitations. This paper describes a multi-robot science exploration software architecture and system called the Telesupervised Adaptive Ocean Sensor Fleet (TAOSF). TAOSF supervises and coordinates a group of robotic boats, the OASIS platforms, to enable in-situ study of phenomena in the ocean/atmosphere interface, as well as on the ocean surface and sub-surface. The OASIS platforms are extended deployment autonomous ocean surface vehicles, whose development is funded separately by the National Oceanic and Atmospheric Administration (NOAA). TAOSF allows a human operator to effectively supervise and coordinate multiple robotic assets using a sliding autonomy control architecture, where the operating mode of the vessels ranges from autonomous control to teleoperated human control. TAOSF increases data-gathering effectiveness and science return while reducing demands on scientists for robotic asset tasking, control, and monitoring. The first field application chosen for TAOSF is the characterization of Harmful Algal Blooms (HABs). We discuss the overall TAOSF architecture, describe field tests conducted under controlled conditions using rhodamine dye as a HAB simulant, present initial results from these tests, and outline the next steps in the development of TAOSF.

  2. Inter Processor Communication for Fault Diagnosis in Multiprocessor Systems

    Directory of Open Access Journals (Sweden)

    C. D. Malleswar

    1994-04-01

    Full Text Available In the preseJlt paper a simple technique is proposed for fault diagnosis for multiprocessor and multiple system environments, wherein all microprocessors in the system are used in part to check the health of their neighbouring processors. It involves building simple fail-safe serial communication links between processors. Processors communicate with each other over these links and each processor is made to go through certain sequences of actions intended for diagnosis, under the observation of another processor .With limited overheads, fault detection can be done by this method. Also outlined are some of the popular techniques used for health check of processor-based systems.

  3. Floating point only SIMD instruction set architecture including compare, select, Boolean, and alignment operations

    Science.gov (United States)

    Gschwind, Michael K.

    2011-03-01

    Mechanisms for implementing a floating point only single instruction multiple data instruction set architecture are provided. A processor is provided that comprises an issue unit, an execution unit coupled to the issue unit, and a vector register file coupled to the execution unit. The execution unit has logic that implements a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA). The floating point vector registers of the vector register file store both scalar and floating point values as vectors having a plurality of vector elements. The processor may be part of a data processing system.

  4. VLSI architectures for computing multiplications and inverses in GF(2-m)

    Science.gov (United States)

    Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.; Reed, I. S.

    1983-01-01

    Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.

  5. VLSI architectures for computing multiplications and inverses in GF(2m)

    Science.gov (United States)

    Wang, C. C.; Truong, T. K.; Shao, H. M.; Deutsch, L. J.; Omura, J. K.

    1985-01-01

    Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that are easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. A pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal-basis representation used together with this multiplier, a pipeline architecture is also developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.

  6. Multiple Bit Error Tolerant Galois Field Architectures Over GF (2m

    Directory of Open Access Journals (Sweden)

    Mahesh Poolakkaparambil

    2012-06-01

    Full Text Available Radiation induced transient faults like single event upsets (SEU and multiple event upsets (MEU in memories are well researched. As a result of the technology scaling, it is observed that the logic blocks are also vulnerable to malfunctioning when they are deployed in radiation prone environment. However, the current literature is lacking efforts to mitigate such issues in the digital logic circuits when exposed to natural radiation prone environment or when they are subjected to malicious attacks by an eavesdropper using highly energized particles. This may lead to catastrophe in critical applications such as widely used cryptographic hardware. In this paper, novel dynamic error correction architectures, based on the BCH codes, is proposed for correcting multiple errors which makes the circuits robust against radiation induced faults irrespective of the location of the errors. As a benchmark test case, the finite field multiplier circuit is considered as the functional block which can be the target for major attacks. The proposed scheme has the capability to handle stuck-at faults that are also a major cause of failure affecting the overall yield of a nano-CMOS integrated chip. The experimental results show that the proposed dynamic error detection and correction architecture results in 50% reduction in critical path delay by dynamically bypassing the error correction logic when no error is present. The area overhead for the larger multiplier is within 150% which is 33% lower than the TMR and comparable to 130% overhead of single error correcting Hamming and LDPC based techniques.

  7. Impact of the Use of Object Request Broker Middleware for Inter-Component Communications in C6416 Digital Signal Processor Based Software Communications Architecture Radio Systems

    Directory of Open Access Journals (Sweden)

    Mohamed I. Yousef

    2012-01-01

    Full Text Available Problem statement: This study presents an in-depth analysis of the performance of Software Communications Architecture (SCA component-based waveform applications in terms of inter-component communications. The main limitation with SCA, in the context of embedded systems, is the additional cost introduced by the use of Object Request Broker (ORB middleware. The ORB middleware handles the interaction between components and objects in SCA distributed environment. This interaction should be highly efficient, due to the real time nature of SCA systems and transparent to the application programmer. Approach: We can achieve high efficiency in SCA systems by enhancing the Inter-Process Communications (IPC mechanisms in Operating systems (OS micro kernels, while we achieve transparency through Interface Definition Language (IDL. Different encoding mechanisms like “External Data Representation (XDR, Network Data Representation (NDR and Common Data Representation (CDR facilitate inter-component communication transparently and efficiently”. Marshalling procedures format data from the local machine representation to common network representations. A most common encoding mechanism for Common Object Request Broker Architecture (CORBA systems is CDR representation. Measurements have been performed with ORBExpress DSP as a CORBA distribution and Open Source SCA Implementation Embedded (OSSIE for SCA implementation. In order to perform these measurements we proposed two metrics for profiling the ORB that are invocation and marshalling. In addition, we propose three elements of data types to evaluate the performance of ORB middleware that are, Basic, Array and Sequence data types. Results: The CORBA bus is really the part, which brings an overhead to the SCA radio systems. This overhead is due to method invocations that have been carried out by ORB middleware. Conclusion: Performance benchmarks of ORBExpress DSP middleware show that, although using CORBA for

  8. Processor-Dependent Malware... and codes

    CERN Document Server

    Desnos, Anthony; Filiol, Eric

    2010-01-01

    Malware usually target computers according to their operating system. Thus we have Windows malwares, Linux malwares and so on ... In this paper, we consider a different approach and show on a technical basis how easily malware can recognize and target systems selectively, according to the onboard processor chip. This technology is very easy to build since it does not rely on deep analysis of chip logical gates architecture. Floating Point Arithmetic (FPA) looks promising to define a set of tests to identify the processor or, more precisely, a subset of possible processors. We give results for different families of processors: AMD, Intel (Dual Core, Atom), Sparc, Digital Alpha, Cell, Atom ... As a conclusion, we propose two {\\it open problems} that are new, to the authors' knowledge.

  9. High speed matrix processors using floating point representation

    Energy Technology Data Exchange (ETDEWEB)

    Birkner, D.A.

    1980-01-01

    The author describes the architecture of a high-speed matrix processor which uses a floating-point format for data representation. It is shown how multipliers and other LSI devices are used in the design to obtain the high speed of the processor.

  10. A Simple and Affordable TTL Processor for the Classroom

    Science.gov (United States)

    Feinberg, Dave

    2007-01-01

    This paper presents a simple 4 bit computer processor design that may be built using TTL chips for less than $65. In addition to describing the processor itself in detail, we discuss our experience using the laboratory kit and its associated machine instruction set to teach computer architecture to high school students. (Contains 3 figures and 5…

  11. Open|SpeedShop Ease of Use Performance Analysis for Heterogenious Processor Systems Project

    Data.gov (United States)

    National Aeronautics and Space Administration — We propose building upon the modular extensible architecture and existing capabilities of Open|SpeedShop to provide seamless, integrated, heterogeneous processor...

  12. Real-Time Signal Processor for Pulsar Studies

    Indian Academy of Sciences (India)

    P. S. Ramkumar; A. A. Deshpande

    2001-12-01

    This paper describes the design, tests and preliminary results of a real-time parallel signal processor built to aid a wide variety of pulsar observations. The signal processor reduces the distortions caused by the effects of dispersion, Faraday rotation, doppler acceleration and parallactic angle variations, at a sustained data rate of 32 Msamples/sec. It also folds the pulses coherently over the period and integrates adjacent samples in time and frequency to enhance the signal-to-noise ratio. The resulting data are recorded for further off-line analysis of the characteristics of pulsars and the intervening medium. The signal processing for analysis of pulsar signals is quite complex, imposing the need for a high computational throughput, typically of the order of a Giga operations per second (GOPS). Conventionally, the high computational demand restricts the flexibility to handle only a few types of pulsar observations. This instrument is designed to handle a wide variety of Pulsar observations with the Giant Metre Wave Radio Telescope (GMRT), and is flexible enough to be used in many other high-speed, signal processing applications. The technology used includes field-programmable-gate-array(FPGA) based data/code routing interfaces, PC-AT based control, diagnostics and data acquisition, digital signal processor (DSP) chip based parallel processing nodes and C language based control software and DSP-assembly programs for signal processing. The architecture and the software implementation of the parallel processor are fine-tuned to realize about 60 MOPS per DSP node and a multiple-instruction-multiple-data (MIMD) capability.

  13. High-resolution real-time imaging processor for airborne SAR

    Science.gov (United States)

    Yu, Weidong; Wu, Shumei

    2003-04-01

    Real-time imaging processor can provide Synthetic Aperture Radar (SAR) image in real-time mode, which is necessary for airborne SAR applications such as real-time monitoring and battle reconnaissance. This paper describes the development of high-resolution real-time imaging processor in Institute of Electronic, Chinese Academy of Sciences (IECAS). The processor uses parallel multiple channels to implement large-volume calculation needed for SAR real-time imaging. A sub-aperture method is utilized to divide azimuth Doppler spectrum into two parts, which correspond two looks. With sub-aperture method, high processing efficiency, less range migration effect and reduced memory volume can be achieved. The imaging swath is also divided into two segments, which are processed in a parallel way. Range-Doppler algorithm, which consists of range migration correction and azimuth compression, is implemented in the processor. Elaborate software programming ensures a high efficient utilization of hardware. Experimental simulation and field flight indicate this system is successful. The principles, architecture, hardware implementation of the processor are presented in this paper in details.

  14. 3D graphical visualization of the genetic architectures underlying complex traits in multiple environments

    Institute of Scientific and Technical Information of China (English)

    HU Cheng-cheng; YE Xiu-zi; ZHANG Yin; YU Rong-dong; YANG Jian; ZHU Jun

    2007-01-01

    An approach for generating interactive 3D graphical visualization of the genetic architectures of complex traits in multiple environments is described. 3D graphical visualization is utilized for making improvements on traditional plots in quantitative trait locus (QTL) mapping analysis. Interactive 3D graphical visualization for abstract expression of QTL, epistasis and their environmental interactions for experimental populations was developed in framework of user-friendly software QTLNetwork (http://ibi.zju.edu.cn/software/qtlnetwork). Novel definition of graphical meta system and computation of virtual coordinates are used to achieve explicit but meaningful visualization. Interactive 3D graphical visualization for QTL analysis provides geneticists and breeders a powerful and easy-to-use tool to analyze and publish their research results.

  15. ET: An Energy-efficient Processor Architecture for Embedded Tera-scale Computing%ET:一种能耗有效的高性能嵌入式处理器

    Institute of Scientific and Technical Information of China (English)

    杨乾明; 伍楠; 管茂林; 张春元; 全巍; 黄达飞

    2011-01-01

    As criterions and algorithms evolve and become molt complex,high performance embedded application demands the high performance and energy efficiency.The challenge,however,is how to turn the VLSI capability into the actual computing performance.This research proposed an energy efficient processor architecture named ET(Embedded Tera-scale Computing),which is composed of many lightweight VLIW processor cores.Also named small COILS.Each core executes a thread witll the mechanisms for explicitly managing the data and instructions.ET uses a hierarchical data registers to reduce the cost of delivering data,and the asymmetric and distributed instruction registers to deliver the instructions.In order to further reduce the energy,ET employs non-deep pipeline and simple control flow and optimizes the execution of loop body of applications.The primary result shows that ET Can achieve the 1TOPS performance and the 100GOPS/W efficiency when scaled to 40nm.%随着标准和算法的不断演进,高端嵌入式应用对性能和能耗提出了越来越高的要求.然而,能耗问题成为将VLSI潜力转换为实际应用需求的最大挑战,基于此,提出ET( Embedded Tera-scale Computing)处理器设计.ET以众多轻量级处理器(称为小核)来搭建目标处理器,每个小核都是一个基于显式数据和指令管理的VLIW处理器,能单独执行一个线程,采用层次化的寄存器文件和非对称全分布式指令寄存器来分别降低数据和指令的供应能耗.为了进一步降低功耗,ET处理器采用了较短的运算流水线和简单的循环控制结构,并面向应用领域针对循环体进行优化.初步的实验结果表明,在40nm工艺下,ET处理器可以获得单芯片1TOPS以上的性能,同时保持操作能效比在100GOPS/W以上.

  16. A CNN-Specific Integrated Processor

    Directory of Open Access Journals (Sweden)

    Suleyman Malki

    2009-01-01

    Full Text Available Integrated Processors (IP are algorithm-specific cores that either by programming or by configuration can be re-used within many microelectronic systems. This paper looks at Cellular Neural Networks (CNN to become realized as IP. First current digital implementations are reviewed, and the memoryprocessor bandwidth issues are analyzed. Then a generic view is taken on the structure of the network, and a new intra-communication protocol based on rotating wheels is proposed. It is shown that this provides for guaranteed high-performance with a minimal network interface. The resulting node is small and supports multi-level CNN designs, giving the system a 30-fold increase in capacity compared to classical designs. As it facilitates multiple operations on a single image, and single operations on multiple images, with minimal access to the external image memory, balancing the internal and external data transfer requirements optimizes the system operation. In conventional digital CNN designs, the treatment of boundary nodes requires additional logic to handle the CNN value propagation scheme. In the new architecture, only a slight modification of the existing cells is necessary to model the boundary effect. A typical prototype for visual pattern recognition will house 4096 CNN cells with a 2% overhead for making it an IP.

  17. A CNN-Specific Integrated Processor

    Science.gov (United States)

    Malki, Suleyman; Spaanenburg, Lambert

    2009-12-01

    Integrated Processors (IP) are algorithm-specific cores that either by programming or by configuration can be re-used within many microelectronic systems. This paper looks at Cellular Neural Networks (CNN) to become realized as IP. First current digital implementations are reviewed, and the memoryprocessor bandwidth issues are analyzed. Then a generic view is taken on the structure of the network, and a new intra-communication protocol based on rotating wheels is proposed. It is shown that this provides for guaranteed high-performance with a minimal network interface. The resulting node is small and supports multi-level CNN designs, giving the system a 30-fold increase in capacity compared to classical designs. As it facilitates multiple operations on a single image, and single operations on multiple images, with minimal access to the external image memory, balancing the internal and external data transfer requirements optimizes the system operation. In conventional digital CNN designs, the treatment of boundary nodes requires additional logic to handle the CNN value propagation scheme. In the new architecture, only a slight modification of the existing cells is necessary to model the boundary effect. A typical prototype for visual pattern recognition will house 4096 CNN cells with a 2% overhead for making it an IP.

  18. Adaptive algebraic multigrid on SIMD architectures

    CERN Document Server

    Heybrock, Simon; Georg, Peter; Wettig, Tilo

    2015-01-01

    We present details of our implementation of the Wuppertal adaptive algebraic multigrid code DD-$\\alpha$AMG on SIMD architectures, with particular emphasis on the Intel Xeon Phi processor (KNC) used in QPACE 2. As a smoother, the algorithm uses a domain-decomposition-based solver code previously developed for the KNC in Regensburg. We optimized the remaining parts of the multigrid code and conclude that it is a very good target for SIMD architectures. Some of the remaining bottlenecks can be eliminated by vectorizing over multiple test vectors in the setup, which is discussed in the contribution of Daniel Richtmann.

  19. C-slow Technique vs Multiprocessor in designing Low Area Customized Instruction set Processor for Embedded Applications

    CERN Document Server

    Akram, Muhammad Adeel; Sarfaraz, Muhammad Masood

    2012-01-01

    The demand for high performance embedded processors, for consumer electronics, is rapidly increasing for the past few years. Many of these embedded processors depend upon custom built Instruction Ser Architecture (ISA) such as game processor (GPU), multimedia processors, DSP processors etc. Primary requirement for consumer electronic industry is low cost with high performance and low power consumption. A lot of research has been evolved to enhance the performance of embedded processors through parallel computing. But some of them focus superscalar processors i.e. single processors with more resources like Instruction Level Parallelism (ILP) which includes Very Long Instruction Word (VLIW) architecture, custom instruction set extensible processor architecture and others require more number of processing units on a single chip like Thread Level Parallelism (TLP) that includes Simultaneous Multithreading (SMT), Chip Multithreading (CMT) and Chip Multiprocessing (CMP). In this paper, we present a new technique, n...

  20. APRON: A Cellular Processor Array Simulation and Hardware Design Tool

    Directory of Open Access Journals (Sweden)

    David R. W. Barr

    2009-01-01

    Full Text Available We present a software environment for the efficient simulation of cellular processor arrays (CPAs. This software (APRON is used to explore algorithms that are designed for massively parallel fine-grained processor arrays, topographic multilayer neural networks, vision chips with SIMD processor arrays, and related architectures. The software uses a highly optimised core combined with a flexible compiler to provide the user with tools for the design of new processor array hardware architectures and the emulation of existing devices. We present performance benchmarks for the software processor array implemented on standard commodity microprocessors. APRON can be configured to use additional processing hardware if necessary and can be used as a complete graphical user interface and development environment for new or existing CPA systems, allowing more users to develop algorithms for CPA systems.

  1. Embedded Processor Laboratory

    Data.gov (United States)

    Federal Laboratory Consortium — The Embedded Processor Laboratory provides the means to design, develop, fabricate, and test embedded computers for missile guidance electronics systems in support...

  2. Multiple mating but not recombination causes quantitative increase in offspring genetic diversity for varying genetic architectures.

    Directory of Open Access Journals (Sweden)

    Olav Rueppell

    Full Text Available Explaining the evolution of sex and recombination is particularly intriguing for some species of eusocial insects because they display exceptionally high mating frequencies and genomic recombination rates. Explanations for both phenomena are based on the notion that both increase colony genetic diversity, with demonstrated benefits for colony disease resistance and division of labor. However, the relative contributions of mating number and recombination rate to colony genetic diversity have never been simultaneously assessed. Our study simulates colonies, assuming different mating numbers, recombination rates, and genetic architectures, to assess their worker genotypic diversity. The number of loci has a strong negative effect on genotypic diversity when the allelic effects are inversely scaled to locus number. In contrast, dominance, epistasis, lethal effects, or limiting the allelic diversity at each locus does not significantly affect the model outcomes. Mating number increases colony genotypic variance and lowers variation among colonies with quickly diminishing returns. Genomic recombination rate does not affect intra- and inter-colonial genotypic variance, regardless of mating frequency and genetic architecture. Recombination slightly increases the genotypic range of colonies and more strongly the number of workers with unique allele combinations across all loci. Overall, our study contradicts the argument that the exceptionally high recombination rates cause a quantitative increase in offspring genotypic diversity across one generation. Alternative explanations for the evolution of high recombination rates in social insects are therefore needed. Short-term benefits are central to most explanations of the evolution of multiple mating and high recombination rates in social insects but our results also apply to other species.

  3. Array processors in chemistry

    Energy Technology Data Exchange (ETDEWEB)

    Ostlund, N.S.

    1980-01-01

    The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

  4. A Video Specific Instruction Set Architecture for ASIP design

    Directory of Open Access Journals (Sweden)

    Zheng Shen

    2007-01-01

    Full Text Available This paper describes a novel video specific instruction set architecture for ASIP design. With single instruction multiple data (SIMD instructions, two destination modes, and video specific instructions, an instruction set architecture is introduced to enhance the performance for video applications. Furthermore, we quantify the improvement on H.263 encoding. In this paper, we evaluate and compare the performance of VS-ISA, other DSPs (digital signal processors, and conventional SIMD media extensions in the context of video coding. Our evaluation results show that VS-ISA improves the processor's performance by approximately 5x on H.263 encoding, and VS-ISA outperforms other architectures by 1.6x to 8.57x in computing IDCT.

  5. Real time processor for array speckle interferometry

    Science.gov (United States)

    Chin, Gordon; Florez, Jose; Borelli, Renan; Fong, Wai; Miko, Joseph; Trujillo, Carlos

    1989-01-01

    The authors are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element two-dimensional complex FFT (fast Fourier transform) and average the power spectrum, all within the 25 ms coherence time for speckles at near-IR (infrared) wavelength. The processor will be a compact unit controlled by a PC with real-time display and data storage capability. This will provide the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with offline methods. The image acquisition and processing, design criteria, and processor architecture are described.

  6. Distributed Processor/Memory Architectures Design Program

    Science.gov (United States)

    1975-02-01

    plemnen ted onl tile imemory UUE 2. lrrir Defctceion BIU INP-UT IFFF MESSAGE Nfeir’nr~~noriir~f fases rie oiri~’ jCONTROL. cr’Ii eIItecion Ir id. ill...is discussed Iin Subsection MD.B.". Trhe ne\\t problem. which to a great extenit is of’ nanagement-decisiois nature. is to comec up with a sensible

  7. Automated Probabilistic System Architecture Analysis in the Multi-Attribute Prediction Language (MAPL: Iteratively Developed using Multiple Case Studies

    Directory of Open Access Journals (Sweden)

    Robert Lagerström

    2017-07-01

    Full Text Available The Multi-Attribute Prediction Language (MAPL, an analysis metamodel for non-functional qualities of system architectures, is introduced. MAPL features automate analysis in five non-functional areas: service cost, service availability, data accuracy, application coupling, and application size. In addition, MAPL explicitly includes utility modeling to make trade-offs between the qualities. The article introduces how each of the five non-functional qualities are modeled and quantitatively analyzed based on the ArchiMate standard for enterprise architecture modeling and the previously published Predictive, Probabilistic Architecture Modeling Framework, building on the well-known UML and OCL formalisms. The main contribution of MAPL lies in the probabilistic use of multi-attribute utility theory for the trade-off analysis of the non-functional properties. Additionally, MAPL proposes novel model-based analyses of several non-functional attributes. We also report how MAPL has iteratively been developed using multiple case studies.

  8. A Novel Architecture for Adaptive Traffic Control in Network on Chip using Code Division Multiple Access Technique

    Directory of Open Access Journals (Sweden)

    Fatemeh. Dehghani

    2016-08-01

    Full Text Available Network on chip has emerged as a long-term and effective method in Multiprocessor System-on-Chip communications in order to overcome the bottleneck in bus based communication architectures. Efficiency and performance of network on chip is so dependent on the architecture and structure of the network. In this paper a new structure and architecture for adaptive traffic control in network on chip using Code Division Multiple Access technique is presented. To solve the problem of synchronous access to bus based interconnection the code division multiple access technique was applied. In the presented structure that is based upon mesh topology and simple routing method we attempted to increase the exchanged data bandwidth rate among different cores. Also an attempt has been made to increase the performance by isolating the target address transfer path from data transfer path. The main goal of this paper is presenting a new structure to improve energy consumption, area and maximum frequency in network on chip systems using information coding and decoding techniques. The presented structure is simulated using Xilinx ISE software and the results show effectiveness of this architecture.

  9. Complete all-optical processing polarization-based binary logic gates and optical processors.

    Science.gov (United States)

    Zaghloul, Y A; Zaghloul, A R M

    2006-10-16

    We present a complete all-optical-processing polarization-based binary-logic system, by which any logic gate or processor can be implemented. Following the new polarization-based logic presented in [Opt. Express 14, 7253 (2006)], we develop a new parallel processing technique that allows for the creation of all-optical-processing gates that produce a unique output either logic 1 or 0 only once in a truth table, and those that do not. This representation allows for the implementation of simple unforced OR, AND, XOR, XNOR, inverter, and more importantly NAND and NOR gates that can be used independently to represent any Boolean expression or function. In addition, the concept of a generalized gate is presented which opens the door for reconfigurable optical processors and programmable optical logic gates. Furthermore, the new design is completely compatible with the old one presented in [Opt. Express 14, 7253 (2006)], and with current semiconductor based devices. The gates can be cascaded, where the information is always on the laser beam. The polarization of the beam, and not its intensity, carries the information. The new methodology allows for the creation of multiple-input-multiple-output processors that implement, by itself, any Boolean function, such as specialized or non-specialized microprocessors. Three all-optical architectures are presented: orthoparallel optical logic architecture for all known and unknown binary gates, singlebranch architecture for only XOR and XNOR gates, and the railroad (RR) architecture for polarization optical processors (POP). All the control inputs are applied simultaneously leading to a single time lag which leads to a very-fast and glitch-immune POP. A simple and easy-to-follow step-by-step algorithm is provided for the POP, and design reduction methodologies are briefly discussed. The algorithm lends itself systematically to software programming and computer-assisted design. As examples, designs of all binary gates, multiple

  10. User microprogrammable processors for high data rate telemetry preprocessing

    Science.gov (United States)

    Pugsley, J. H.; Ogrady, E. P.

    1973-01-01

    The use of microprogrammable processors for the preprocessing of high data rate satellite telemetry is investigated. The following topics are discussed along with supporting studies: (1) evaluation of commercial microprogrammable minicomputers for telemetry preprocessing tasks; (2) microinstruction sets for telemetry preprocessing; and (3) the use of multiple minicomputers to achieve high data processing. The simulation of small microprogrammed processors is discussed along with examples of microprogrammed processors.

  11. Benchmarking a DSP processor

    OpenAIRE

    Lennartsson, Per; Nordlander, Lars

    2002-01-01

    This Master thesis describes the benchmarking of a DSP processor. Benchmarking means measuring the performance in some way. In this report, we have focused on the number of instruction cycles needed to execute certain algorithms. The algorithms we have used in the benchmark are all very common in signal processing today. The results we have reached in this thesis have been compared to benchmarks for other processors, performed by Berkeley Design Technology, Inc. The algorithms were programm...

  12. Resource efficiency of hardware extensions of a 4-issue VLIW processor for elliptic curve cryptography

    Science.gov (United States)

    Jungeblut, T.; Puttmann, C.; Dreesen, R.; Porrmann, M.; Thies, M.; Rückert, U.; Kastens, U.

    2010-12-01

    The secure transmission of data plays a significant role in today's information era. Especially in the area of public-key-cryptography methods, which are based on elliptic curves (ECC), gain more and more importance. Compared to asymmetric algorithms, like RSA, ECC can be used with shorter key lengths, while achieving an equal level of security. The performance of ECC-algorithms can be increased significantly by adding application specific hardware extensions. Due to their fine grained parallelism, VLIW-processors are well suited for the execution of ECC algorithms. In this work, we extended the fourfold parallel CoreVA-VLIW-architecture by several hardware accelerators to increase the resource efficiency of the overall system. For the design-space exploration we use a dual design flow, which is based on the automatic generation of a complete C-compiler based tool chain from a central processor specification. Using the hardware accelerators the performance of the scalar multiplication on binary fields can be increased by the factor of 29. The energy consumption can be reduced by up to 90%. The extended processor hardware was mapped on a current 65 nm low-power standard-cell-technology. The chip area of the CoreVA-VLIW-architecture is 0.24 mm2 at a power consumption of 29 mW/MHz. The performance gain is analyzed in respect to the increased hardware costs, as chip area or power consumption.

  13. System Level Design of Reconfigurable Server Farms Using Elliptic Curve Cryptography Processor Engines

    Directory of Open Access Journals (Sweden)

    Sangook Moon

    2014-01-01

    Full Text Available As today’s hardware architecture becomes more and more complicated, it is getting harder to modify or improve the microarchitecture of a design in register transfer level (RTL. Consequently, traditional methods we have used to develop a design are not capable of coping with complex designs. In this paper, we suggest a way of designing complex digital logic circuits with a soft and advanced type of SystemVerilog at an electronic system level. We apply the concept of design-and-reuse with a high level of abstraction to implement elliptic curve crypto-processor server farms. With the concept of the superior level of abstraction to the RTL used with the traditional HDL design, we successfully achieved the soft implementation of the crypto-processor server farms as well as robust test bench code with trivial effort in the same simulation environment. Otherwise, it could have required error-prone Verilog simulations for the hardware IPs and other time-consuming jobs such as C/SystemC verification for the software, sacrificing more time and effort. In the design of the elliptic curve cryptography processor engine, we propose a 3X faster GF(2m serial multiplication architecture.

  14. Modeling and optimization of multiple unmanned aerial vehicles system architecture alternatives.

    Science.gov (United States)

    Qin, Dongliang; Li, Zhifei; Yang, Feng; Wang, Weiping; He, Lei

    2014-01-01

    Unmanned aerial vehicle (UAV) systems have already been used in civilian activities, although very limitedly. Confronted different types of tasks, multi UAVs usually need to be coordinated. This can be extracted as a multi UAVs system architecture problem. Based on the general system architecture problem, a specific description of the multi UAVs system architecture problem is presented. Then the corresponding optimization problem and an efficient genetic algorithm with a refined crossover operator (GA-RX) is proposed to accomplish the architecting process iteratively in the rest of this paper. The availability and effectiveness of overall method is validated using 2 simulations based on 2 different scenarios.

  15. Application-Specific Instruction Set Processor Implementation of List Sphere Detector

    Directory of Open Access Journals (Sweden)

    Salmela Perttu

    2007-01-01

    Full Text Available Multiple-input multiple-output (MIMO technology enables higher transmission capacity without additional frequency spectrum and is becoming a part of many wireless system standards. Sphere detection has been introduced in MIMO systems to achieve maximum likelihood (ML or near-ML estimation with reduced complexity. This paper reviews related work on sphere detector implementations and presents an application-specific instruction set processor (ASIP implementation of K-best list sphere detector (LSD using transport triggered architecture (TTA. The implementation is based on using memory and heap data structure for symbol vector sorting. The design space is explored by presenting several variations of the implementation and comparing them with each other in terms of their latencies and hardware complexities. An early proposal for a parallelized architecture with a decoding throughput of approximately 5.3 Mbps is presented

  16. Explore the Performance of the ARM Processor Using JPEG

    Directory of Open Access Journals (Sweden)

    A.D. Jadhav

    2010-01-01

    Full Text Available Recently, the evolution of embedded systems has shown a strong trend towards application- specific, single- chip solutions. The ARM processor core is a leading RISC processor architecture in the embedded domain. The ARM family of processors supports a unique feature of code size reduction. In this paper it is illustrated using an embedded platform trying to design an image encoder, more specifically a JPEG encoder using ARM7TDMI processor. Here gray scale image is used and it is coded by using keil software and same procedure is repeated by using MATLAB software for compare the results with standard one. Successfully putting a new application of JPEG on ARM7 processor.

  17. Description and Simulation of a Fast Packet Switch Architecture for Communication Satellites

    Science.gov (United States)

    Quintana, Jorge A.; Lizanich, Paul J.

    1995-01-01

    The NASA Lewis Research Center has been developing the architecture for a multichannel communications signal processing satellite (MCSPS) as part of a flexible, low-cost meshed-VSAT (very small aperture terminal) network. The MCSPS architecture is based on a multifrequency, time-division-multiple-access (MF-TDMA) uplink and a time-division multiplex (TDM) downlink. There are eight uplink MF-TDMA beams, and eight downlink TDM beams, with eight downlink dwells per beam. The information-switching processor, which decodes, stores, and transmits each packet of user data to the appropriate downlink dwell onboard the satellite, has been fully described by using VHSIC (Very High Speed Integrated-Circuit) Hardware Description Language (VHDL). This VHDL code, which was developed in-house to simulate the information switching processor, showed that the architecture is both feasible and viable. This paper describes a shared-memory-per-beam architecture, its VHDL implementation, and the simulation efforts.

  18. Digital optical cellular image processor (DOCIP) - Experimental implementation

    Science.gov (United States)

    Huang, K.-S.; Sawchuk, A. A.; Jenkins, B. K.; Chavel, P.; Wang, J.-M.; Weber, A. G.; Wang, C.-H.; Glaser, I.

    1993-01-01

    We demonstrate experimentally the concept of the digital optical cellular image processor architecture by implementing one processing element of a prototype optical computer that includes a 54-gate processor, an instruction decoder, and electronic input-output interfaces. The processor consists of a two-dimensional (2-D) array of 54 optical logic gates implemented by use of a liquid-crystal light valve and a 2-D array of 53 subholograms to provide interconnections between gates. The interconnection hologram is fabricated by a computer-controlled optical system.

  19. Does NASA's Constellation Architecture Offer Opportunities to Achieve Multiple Additional Goals in Space?

    Science.gov (United States)

    Thronson, Harley; Lester, Daniel F.

    2008-01-01

    Every major NASA human spaceflight program in the last four decades has been modified to achieve goals in space not incorporated within the original design goals: the Apollo Applications Program, Skylab, Space Shuttle, and International Space Station. Several groups in the US have been identifying major future science goals, the science facilities necessary to investigate them, as well as possible roles for augmented versions of elements of NASA's Constellation program. Specifically, teams in the astronomy community have been developing concepts for very capable missions to follow the James Webb Space Telescope that could take advantage of - or require - free-space operations by astronauts and/or robots. Taking as one example, the Single-Aperture Far-InfraRed (SAFIR) telescope with a approx. 10+ m aperture proposed for operation in the 2020 timeframe. According to current NASA plans, the Ares V launch vehicle (or a variant) will be available about the same time, as will the capability to transport astronauts to the vicinity of the Moon via the Orion Crew Exploration Vehicle and associated systems. [As the lunar surface offers no advantages - and major disadvantages - for most major optical systems, the expensive system for landing and operating on the lunar surface is not required.] Although as currently conceived, SAFIR and other astronomical missions will operate at the Sun-Earth L2 location, it appears trivial to travel for servicing to the more accessible Earth-Moon L1,2 locations. Moreover. as the recent Orbital Express and Automated Transfer Vehicle missions have demonstrated, future robotic capabilities should offer capabilities that would (remotely) extend human presence far beyond the vicinity of the Earth. In addition to multiplying the value of NASA's architecture for future human spaceflight to achieve the goals multiple major stakeholders. if humans one day travel beyond the Earth-Moon system - say, to Mars - technologies and capabilities for operating

  20. Programming Massively Parallel Architectures using MARTE: a Case Study

    CERN Document Server

    Rodrigues, Wendell; Dekeyser, Jean-Luc

    2011-01-01

    Nowadays, several industrial applications are being ported to parallel architectures. These applications take advantage of the potential parallelism provided by multiple core processors. Many-core processors, especially the GPUs(Graphics Processing Unit), have led the race of floating-point performance since 2003. While the performance improvement of general- purpose microprocessors has slowed significantly, the GPUs have continued to improve relentlessly. As of 2009, the ratio between many-core GPUs and multicore CPUs for peak floating-point calculation throughput is about 10 times. However, as parallel programming requires a non-trivial distribution of tasks and data, developers find it hard to implement their applications effectively. Aiming to improve the use of many-core processors, this work presents an case-study using UML and MARTE profile to specify and generate OpenCL code for intensive signal processing applications. Benchmark results show us the viability of the use of MDE approaches to generate G...

  1. Signal processor packaging design

    Science.gov (United States)

    McCarley, Paul L.; Phipps, Mickie A.

    1993-10-01

    The Signal Processor Packaging Design (SPPD) program was a technology development effort to demonstrate that a miniaturized, high throughput programmable processor could be fabricated to meet the stringent environment imposed by high speed kinetic energy guided interceptor and missile applications. This successful program culminated with the delivery of two very small processors, each about the size of a large pin grid array package. Rockwell International's Tactical Systems Division in Anaheim, California developed one of the processors, and the other was developed by Texas Instruments' (TI) Defense Systems and Electronics Group (DSEG) of Dallas, Texas. The SPPD program was sponsored by the Guided Interceptor Technology Branch of the Air Force Wright Laboratory's Armament Directorate (WL/MNSI) at Eglin AFB, Florida and funded by SDIO's Interceptor Technology Directorate (SDIO/TNC). These prototype processors were subjected to rigorous tests of their image processing capabilities, and both successfully demonstrated the ability to process 128 X 128 infrared images at a frame rate of over 100 Hz.

  2. Token-Aware Completion Functions for Elastic Processor Verification

    Directory of Open Access Journals (Sweden)

    Sudarshan K. Srinivasan

    2009-01-01

    Full Text Available We develop a formal verification procedure to check that elastic pipelined processor designs correctly implement their instruction set architecture (ISA specifications. The notion of correctness we use is based on refinement. Refinement proofs are based on refinement maps, which—in the context of this problem—are functions that map elastic processor states to states of the ISA specification model. Data flow in elastic architectures is complicated by the insertion of any number of buffers in any place in the design, making it hard to construct refinement maps for elastic systems in a systematic manner. We introduce token-aware completion functions, which incorporate a mechanism to track the flow of data in elastic pipelines, as a highly automated and systematic approach to construct refinement maps. We demonstrate the efficiency of the overall verification procedure based on token-aware completion functions using six elastic pipelined processor models based on the DLX architecture.

  3. SET: Session Layer-Assisted Efficient TCP Management Architecture for 6LoWPAN with Multiple Gateways

    Directory of Open Access Journals (Sweden)

    Akbar AliHammad

    2010-01-01

    Full Text Available 6LoWPAN (IPv6 based Low-Power Personal Area Network is a protocol specification that facilitates communication of IPv6 packets on top of IEEE 802.15.4 so that Internet and wireless sensor networks can be inter-connected. This interconnection is especially required in commercial and enterprise applications of sensor networks where reliable and timely data transfers such as multiple code updates are needed from Internet nodes to sensor nodes. For this type of inbound traffic which is mostly bulk, TCP as transport layer protocol is essential, resulting in end-to-end TCP session through a default gateway. In this scenario, a single gateway tends to become the bottleneck because of non-uniform connectivity to all the sensor nodes besides being vulnerable to buffer overflow. We propose SET; a management architecture for multiple split-TCP sessions across a number of serving gateways. SET implements striping and multiple TCP session management through a shim at session layer. Through analytical modeling and ns2 simulations, we show that our proposed architecture optimizes communication for ingress bulk data transfer while providing associated load balancing services. We conclude that multiple split-TCP sessions managed in parallel across a number of gateways result in reduced latency for bulk data transfer and provide robustness against gateway failures.

  4. Launching applications on compute and service processors running under different operating systems in scalable network of processor boards with routers

    Science.gov (United States)

    Tomkins, James L.; Camp, William J.

    2009-03-17

    A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure also permits easy physical scalability of the computing apparatus. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.

  5. The Milstar Advanced Processor

    Science.gov (United States)

    Tjia, Khiem-Hian; Heely, Stephen D.; Morphet, John P.; Wirick, Kevin S.

    The Milstar Advanced Processor (MAP) is a 'drop-in' replacement for its predecessor which preserves existing interfaces with other Milstar satellite processors and minimizes the impact of such upgrading to already-developed application software. In addition to flight software development, and hardware development that involves the application of VHSIC technology to the electrical design, the MAP project is developing two sophisticated and similar test environments. High density RAM and ROM are employed by the MAP memory array. Attention is given to the fine-pitch VHSIC design techniques and lead designs used, as well as the tole of TQM and concurrent engineering in the development of the MAP manufacturing process.

  6. Automation in Architectural Photogrammetry: Line-Photogrammetry for the Reconstruction from Single and Multiple Images

    NARCIS (Netherlands)

    Van den Heuvel, F.A.

    2003-01-01

    Architectural photogrammetry has been practised for more than a century for the documentation of cultural heritage. Nowadays, the emphasis is on the construction of computer models for virtual reality applications. Since the introduction of the computer, and later the digital camera, research in pho

  7. A Multi-Objective Compounded Local Mobile Cloud Architecture Using Priority Queues to Process Multiple Jobs

    Science.gov (United States)

    Wei, Xiaohui; Sun, Bingyi; Cui, Jiaxu; Xu, Gaochao

    2016-01-01

    As a result of the greatly increased use of mobile devices, the disadvantages of portable devices have gradually begun to emerge. To solve these problems, the use of mobile cloud computing assisted by cloud data centers has been proposed. However, cloud data centers are always very far from the mobile requesters. In this paper, we propose an improved multi-objective local mobile cloud model: Compounded Local Mobile Cloud Architecture with Dynamic Priority Queues (LMCpri). This new architecture could briefly store jobs that arrive simultaneously at the cloudlet in different priority positions according to the result of auction processing, and then execute partitioning tasks on capable helpers. In the Scheduling Module, NSGA-II is employed as the scheduling algorithm to shorten processing time and decrease requester cost relative to PSO and sequential scheduling. The simulation results show that the number of iteration times that is defined to 30 is the best choice of the system. In addition, comparing with LMCque, LMCpri is able to effectively accommodate a requester who would like his job to be executed in advance and shorten execution time. Finally, we make a comparing experiment between LMCpri and cloud assisting architecture, and the results reveal that LMCpri presents a better performance advantage than cloud assisting architecture. PMID:27419854

  8. A Multi-Objective Compounded Local Mobile Cloud Architecture Using Priority Queues to Process Multiple Jobs.

    Directory of Open Access Journals (Sweden)

    Xiaohui Wei

    Full Text Available As a result of the greatly increased use of mobile devices, the disadvantages of portable devices have gradually begun to emerge. To solve these problems, the use of mobile cloud computing assisted by cloud data centers has been proposed. However, cloud data centers are always very far from the mobile requesters. In this paper, we propose an improved multi-objective local mobile cloud model: Compounded Local Mobile Cloud Architecture with Dynamic Priority Queues (LMCpri. This new architecture could briefly store jobs that arrive simultaneously at the cloudlet in different priority positions according to the result of auction processing, and then execute partitioning tasks on capable helpers. In the Scheduling Module, NSGA-II is employed as the scheduling algorithm to shorten processing time and decrease requester cost relative to PSO and sequential scheduling. The simulation results show that the number of iteration times that is defined to 30 is the best choice of the system. In addition, comparing with LMCque, LMCpri is able to effectively accommodate a requester who would like his job to be executed in advance and shorten execution time. Finally, we make a comparing experiment between LMCpri and cloud assisting architecture, and the results reveal that LMCpri presents a better performance advantage than cloud assisting architecture.

  9. A Multi-Objective Compounded Local Mobile Cloud Architecture Using Priority Queues to Process Multiple Jobs.

    Science.gov (United States)

    Wei, Xiaohui; Sun, Bingyi; Cui, Jiaxu; Xu, Gaochao

    2016-01-01

    As a result of the greatly increased use of mobile devices, the disadvantages of portable devices have gradually begun to emerge. To solve these problems, the use of mobile cloud computing assisted by cloud data centers has been proposed. However, cloud data centers are always very far from the mobile requesters. In this paper, we propose an improved multi-objective local mobile cloud model: Compounded Local Mobile Cloud Architecture with Dynamic Priority Queues (LMCpri). This new architecture could briefly store jobs that arrive simultaneously at the cloudlet in different priority positions according to the result of auction processing, and then execute partitioning tasks on capable helpers. In the Scheduling Module, NSGA-II is employed as the scheduling algorithm to shorten processing time and decrease requester cost relative to PSO and sequential scheduling. The simulation results show that the number of iteration times that is defined to 30 is the best choice of the system. In addition, comparing with LMCque, LMCpri is able to effectively accommodate a requester who would like his job to be executed in advance and shorten execution time. Finally, we make a comparing experiment between LMCpri and cloud assisting architecture, and the results reveal that LMCpri presents a better performance advantage than cloud assisting architecture.

  10. Automation in Architectural Photogrammetry: Line-Photogrammetry for the Reconstruction from Single and Multiple Images

    NARCIS (Netherlands)

    Van den Heuvel, F.A.

    2003-01-01

    Architectural photogrammetry has been practised for more than a century for the documentation of cultural heritage. Nowadays, the emphasis is on the construction of computer models for virtual reality applications. Since the introduction of the computer, and later the digital camera, research in

  11. Efficient SIMD optimization for media processors

    Institute of Scientific and Technical Information of China (English)

    Jian-peng ZHOU; Ce SHI

    2008-01-01

    Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.

  12. Research of matrix multiplication based on CUDA architecture%基于CUDA架构矩阵乘法的研究

    Institute of Scientific and Technical Information of China (English)

    马梦琦; 刘羽; 曾胜田

    2011-01-01

    首先介绍了CUDA架构特点,在GPU上基于CUDA使用两种方法实现了矩阵乘法,并根据CUDA特有的软硬件架构对矩阵乘法进行了优化。然后计算GPU峰值比并进行了分析。实验结果表明,基于CUDA的矩阵乘法相对于CPU矩阵乘法获得了很高的加速比,最高加速比达到1079.64。GPU浮点运算能力得到有效利用,峰值比最高达到30.85%。%This paper firstly introduced the characteristics of CUDA architecture, realized matrix multiplication using two ways on the GPU, and optimized the matrix multiplication according to unique hardware and software architecture based on CUDA. Then calculated and analyzed the peak ratio of GPU. Experimental results showed that CUDA-based matrix multiplication on the GPU achieved a higher speed-up ratio compared with that on tbe CPU. The maximum speedup to 1 079.64. The capability of floatingpoint calculations on the GPU was effectively taken advantage of,the highest peak ratio reached more than 30.85%.

  13. Interactive Digital Signal Processor

    Science.gov (United States)

    Mish, W. H.

    1985-01-01

    Interactive Digital Signal Processor, IDSP, consists of set of time series analysis "operators" based on various algorithms commonly used for digital signal analysis. Processing of digital signal time series to extract information usually achieved by applications of number of fairly standard operations. IDSP excellent teaching tool for demonstrating application for time series operators to artificially generated signals.

  14. Beyond processor sharing

    NARCIS (Netherlands)

    Aalto, S.; Ayesta, U.; Borst, S.C.; Misra, V.; Núñez Queija, R.

    2007-01-01

    While the (Egalitarian) Processor-Sharing (PS) discipline offers crucial insights in the performance of fair resource allocation mechanisms, it is inherently limited in analyzing and designing differentiated scheduling algorithms such as Weighted Fair Queueing and Weighted Round-Robin. The Discrimin

  15. Conversion via software of a simd processor into a mimd processor

    Energy Technology Data Exchange (ETDEWEB)

    Guzman, A.; Gerzso, M.; Norkin, K.B.; Vilenkin, S.Y.

    1983-01-01

    A method is described which takes a pure LISP program and automatically decomposes it via automatic parallelization into several parts, one for each processor of an SIMD architecture. Each of these parts is a different execution flow, i.e., a different program. The execution of these different programs by an SIMD architecture is examined. The method has been developed in some detail for the PS-2000, an SIMD Soviet multiprocessor, making it behave like AHR, a Mexican MIMD multi-microprocessor. Both the PS-2000 and AHR execute a pure LISP program in parallel; its decomposition into >n> pieces, their synchronization, scheduling, etc., are performed by the system (hardware and software). In order to achieve simultaneous execution of different programs in an SIMD processor, the method uses a scheme of node scheduling and node exportation. 14 references.

  16. The Central Trigger Processor (CTP)

    CERN Multimedia

    Franchini, Matteo

    2016-01-01

    The Central Trigger Processor (CTP) receives trigger information from the calorimeter and muon trigger processors, as well as from other sources of trigger. It makes the Level-1 decision (L1A) based on a trigger menu.

  17. Adaptive reconfigurable distributed sensor architecture

    Science.gov (United States)

    Akey, Mark L.

    1997-07-01

    The infancy of unattended ground based sensors is quickly coming to an end with the arrival of on-board GPS, networking, and multiple sensing capabilities. Unfortunately, their use is only first-order at best: GPS assists with sensor report registration; networks push sensor reports back to the warfighter and forwards control information to the sensors; multispectral sensing is a preset, pre-deployment consideration; and the scalability of large sensor networks is questionable. Current architectures provide little synergy among or within the sensors either before or after deployment, and do not map well to the tactical user's organizational structures and constraints. A new distributed sensor architecture is defined which moves well beyond single sensor, single task architectures. Advantages include: (1) automatic mapping of tactical direction to multiple sensors' tasks; (2) decentralized, distributed management of sensor resources and tasks; (3) software reconfiguration of deployed sensors; (4) network scalability and flexibility to meet the constraints of tactical deployments, and traditional combat organizations and hierarchies; and (5) adaptability to new battlefield communication paradigms such as BADD (Battlefield Analysis and Data Dissemination). The architecture is supported in two areas: a recursive, structural definition of resource configuration and management via loose associations; and a hybridization of intelligent software agents with tele- programming capabilities. The distributed sensor architecture is examined within the context of air-deployed ground sensors with acoustic, communication direction finding, and infra-red capabilities. Advantages and disadvantages of the architecture are examined. Consideration is given to extended sensor life (up to 6 months), post-deployment sensor reconfiguration, limited on- board sensor resources (processor and memory), and bandwidth. It is shown that technical tasking of the sensor suite can be automatically

  18. A Domain Specific DSP Processor

    OpenAIRE

    Tell, Eric

    2001-01-01

    This thesis describes the design of a domain specific DSP processor. The thesis is divided into two parts. The first part gives some theoretical background, describes the different steps of the design process (both for DSP processors in general and for this project) and motivates the design decisions made for this processor. The second part is a nearly complete design specification. The intended use of the processor is as a platform for hardware acceleration units. Support for this has howe...

  19. Processor register error correction management

    Science.gov (United States)

    Bose, Pradip; Cher, Chen-Yong; Gupta, Meeta S.

    2016-12-27

    Processor register protection management is disclosed. In embodiments, a method of processor register protection management can include determining a sensitive logical register for executable code generated by a compiler, generating an error-correction table identifying the sensitive logical register, and storing the error-correction table in a memory accessible by a processor. The processor can be configured to generate a duplicate register of the sensitive logical register identified by the error-correction table.

  20. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    Energy Technology Data Exchange (ETDEWEB)

    Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

    2008-01-01

    performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.

  1. Performance of Artificial Intelligence Workloads on the Intel Core 2 Duo Series Desktop Processors

    Directory of Open Access Journals (Sweden)

    Abdul Kareem PARCHUR

    2010-12-01

    Full Text Available As the processor architecture becomes more advanced, Intel introduced its Intel Core 2 Duo series processors. Performance impact on Intel Core 2 Duo processors are analyzed using SPEC CPU INT 2006 performance numbers. This paper studied the behavior of Artificial Intelligence (AI benchmarks on Intel Core 2 Duo series processors. Moreover, we estimated the task completion time (TCT @1 GHz, @2 GHz and @3 GHz Intel Core 2 Duo series processors frequency. Our results show the performance scalability in Intel Core 2 Duo series processors. Even though AI benchmarks have similar execution time, they have dissimilar characteristics which are identified using principal component analysis and dendogram. As the processor frequency increased from 1.8 GHz to 3.167 GHz the execution time is decreased by ~370 sec for AI workloads. In the case of Physics/Quantum Computing programs it was ~940 sec.

  2. Time-Predictable Computer Architecture

    Directory of Open Access Journals (Sweden)

    Schoeberl Martin

    2009-01-01

    Full Text Available Today's general-purpose processors are optimized for maximum throughput. Real-time systems need a processor with both a reasonable and a known worst-case execution time (WCET. Features such as pipelines with instruction dependencies, caches, branch prediction, and out-of-order execution complicate WCET analysis and lead to very conservative estimates. In this paper, we evaluate the issues of current architectures with respect to WCET analysis. Then, we propose solutions for a time-predictable computer architecture. The proposed architecture is evaluated with implementation of some features in a Java processor. The resulting processor is a good target for WCET analysis and still performs well in the average case.

  3. Cache Energy Optimization Techniques For Modern Processors

    Energy Technology Data Exchange (ETDEWEB)

    Mittal, Sparsh [ORNL

    2013-01-01

    Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In this book, we present novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. Also, we present cache energy saving techniques for caches designed with both conventional SRAM devices and emerging non-volatile devices such as STT-RAM (spin-torque transfer RAM). We present software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with state-of-the-art techniques and have found that our techniques outperform them in terms of energy efficiency and other relevant metrics. The techniques presented in this book have important applications in improving energy-efficiency of higher-end embedded, desktop, QoS, real-time, server processors and multitasking systems. This book is intended to be a valuable guide for both

  4. Conversion of an 8-bit to a 16-bit Soft-core RISC Processor

    Directory of Open Access Journals (Sweden)

    Ahmad Jamal Salim

    2013-03-01

    Full Text Available The demand for 8-bit processors nowadays is still going strong despite efforts by manufacturers in producing higher end microcontroller solutions to the mass market. Low-end processor offers a simple, low-cost and fast solution especially on I/O applications development in embedded system. However, due to architectural constraint, complex calculation could not be performed efficiently on 8-bit processor. This paper presents the conversion method from an 8-bit to a 16-bit Reduced Instruction Set Computer (RISC processor in a soft-core reconfigurable platform in order to extend its capability in handling larger data sets thus enabling intensive calculations process. While the conversion expands the data bus width to 16-bit, it also maintained the simple architecture design of an 8-bit processor.The expansion also provides more room for improvement to the processor’s performance. The modified architecture is successfully simulated in CPUSim together with its new instruction set architecture (ISA. Xilinx Virtex-6 platform is utilized to execute and verified the architecture. Results show that the modified 16-bit RISC architecture only required 17% more register slice on Field Programmable Gate Array (FPGA implementation which is a slight increase compared to the original 8-bit RISC architecture. A test program containing instruction sets that handle 16-bit data are also simulated and verified. As the 16-bit architecture is described as a soft-core, further modifications could be performed in order to customize the architecture to suit any specific applications.

  5. Space and frequency-multiplexed optical linear algebra processor - Fabrication and initial tests

    Science.gov (United States)

    Casasent, D.; Jackson, J.

    1986-01-01

    A new optical linear algebra processor architecture is described. Space and frequency-multiplexing are used to accommodate bipolar and complex-valued data. A fabricated laboratory version of this processor is described, the electronic support system used is discussed, and initial test data obtained on it are presented.

  6. Efficient FPGA Implementation of High-Throughput Mixed Radix Multipath Delay Commutator FFT Processor for MIMO-OFDM

    Directory of Open Access Journals (Sweden)

    DALI, M.

    2017-02-01

    Full Text Available This article presents and evaluates pipelined architecture designs for an improved high-frequency Fast Fourier Transform (FFT processor implemented on Field Programmable Gate Arrays (FPGA for Multiple Input Multiple Output Orthogonal Frequency Division Multiplexing (MIMO-OFDM. The architecture presented is a Mixed-Radix Multipath Delay Commutator. The presented parallel architecture utilizes fewer hardware resources compared to Radix-2 architecture, while maintaining simple control and butterfly structures inherent to Radix-2 implementations. The high-frequency design presented allows enhancing system throughput without requiring additional parallel data paths common in other current approaches, the presented design can process two and four independent data streams in parallel and is suitable for scaling to any power of two FFT size N. FPGA implementation of the architecture demonstrated significant resource efficiency and high-throughput in comparison to relevant current approaches within literature. The proposed architecture designs were realized with Xilinx System Generator (XSG and evaluated on both Virtex-5 and Virtex-7 FPGA devices. Post place and route results demonstrated maximum frequency values over 400 MHz and 470 MHz for Virtex-5 and Virtex-7 FPGA devices respectively.

  7. Stereoscopic Optical Signal Processor

    Science.gov (United States)

    Graig, Glenn D.

    1988-01-01

    Optical signal processor produces two-dimensional cross correlation of images from steroscopic video camera in real time. Cross correlation used to identify object, determines distance, or measures movement. Left and right cameras modulate beams from light source for correlation in video detector. Switch in position 1 produces information about range of object viewed by cameras. Position 2 gives information about movement. Position 3 helps to identify object.

  8. 多线程计算模型、体系结构与编译技术%Multithreaded Computing Model,Architecture and Compiling Technique

    Institute of Scientific and Technical Information of China (English)

    林海波; 汤志忠

    2003-01-01

    Multithreading has been proposed as an efficient computing model for improving parallelism. It combinesadvantages of both dataflow architecture and von Neumann architecture,leading to high performance and efficiency.The-state-of-the-art multithreaded computing model includes Blocking thread and Non-blocking thread, the corre-sponded multithreaded architecting can be classified as Multiple Context Processor and Hybrid Architecture. Threadpartitioning is one of the most important compiling issues in multithreaded computing. The idea of multithreading willbe developed further on the move of architecture,compiling technique,and operating system.

  9. Cation- and anion-exchanges induce multiple distinct rearrangements within metallosupramolecular architectures.

    Science.gov (United States)

    Riddell, Imogen A; Ronson, Tanya K; Clegg, Jack K; Wood, Christopher S; Bilbeisi, Rana A; Nitschke, Jonathan R

    2014-07-01

    Different anionic templates act to give rise to four distinct Cd(II)-based architectures: a Cd2L3 helicate, a Cd8L12 distorted cuboid, a Cd10L15 pentagonal prism, and a Cd12L18 hexagonal prism, which respond to both anionic and cationic components. Interconversions between architectures are driven by the addition of anions that bind more strongly within a given product framework. The addition of Fe(II) prompted metal exchange and transformation to a Fe4L6 tetrahedron or a Fe10L15 pentagonal prism, depending on the anionic templates present. The equilibrium between the Cd12L18 prism and the Cd2L3 triple helicate displayed concentration dependence, with higher concentrations favoring the prism. The Cd12L18 structure serves as an intermediate en route to a hexafluoroarsenate-templated Cd10L15 complex, whereby the structural features of the hexagonal prism preorganize the system to form the structurally related pentagonal prism. In addition to the interconversion pathways investigated, we also report the single-crystal X-ray structure of bifluoride encapsulated within a Cd10L15 complex and report solution state data for J-coupling through a CH···F(-) hydrogen bond indicating the strength of these interactions in solution.

  10. Evolution of the Florida Launch Site Architecture: Embracing Multiple Customers, Enhancing Launch Opportunities

    Science.gov (United States)

    Colloredo, Scott; Gray, James A.

    2011-01-01

    The impending conclusion of the Space Shuttle Program and the Constellation Program cancellation unveiled in the FY2011 President's budget created a large void for human spaceflight capability and specifically launch activity from the Florida launch Site (FlS). This void created an opportunity to re-architect the launch site to be more accommodating to the future NASA heavy lift and commercial space industry. The goal is to evolve the heritage capabilities into a more affordable and flexible launch complex. This case study will discuss the FlS architecture evolution from the trade studies to select primary launch site locations for future customers, to improving infrastructure; promoting environmental remediation/compliance; improving offline processing, manufacturing, & recovery; developing range interface and control services with the US Air Force, and developing modernization efforts for the launch Pad, Vehicle Assembly Building, Mobile launcher, and supporting infrastructure. The architecture studies will steer how to best invest limited modernization funding from initiatives like the 21 st elSe and other potential funding.

  11. Adaptive Optics Simulation for the World's Largest Telescope on Multicore Architectures with Multiple GPUs

    KAUST Repository

    Ltaief, Hatem

    2016-06-02

    We present a high performance comprehensive implementation of a multi-object adaptive optics (MOAO) simulation on multicore architectures with hardware accelerators in the context of computational astronomy. This implementation will be used as an operational testbed for simulating the de- sign of new instruments for the European Extremely Large Telescope project (E-ELT), the world\\'s biggest eye and one of Europe\\'s highest priorities in ground-based astronomy. The simulation corresponds to a multi-step multi-stage pro- cedure, which is fed, near real-time, by system and turbulence data coming from the telescope environment. Based on the PLASMA library powered by the OmpSs dynamic runtime system, our implementation relies on a task-based programming model to permit an asynchronous out-of-order execution. Using modern multicore architectures associated with the enormous computing power of GPUS, the resulting data-driven compute-intensive simulation of the entire MOAO application, composed of the tomographic reconstructor and the observing sequence, is capable of coping with the aforementioned real-time challenge and stands as a reference implementation for the computational astronomy community.

  12. Genetic architecture of carbon isotope composition and growth in Eucalyptus across multiple environments.

    Science.gov (United States)

    Bartholomé, Jérôme; Mabiala, André; Savelli, Bruno; Bert, Didier; Brendel, Oliver; Plomion, Christophe; Gion, Jean-Marc

    2015-06-01

    In the context of climate change, the water-use efficiency (WUE) of highly productive tree varieties, such as eucalypts, has become a major issue for breeding programmes. This study set out to dissect the genetic architecture of carbon isotope composition (δ(13) C), a proxy of WUE, across several environments. A family of Eucalyptus urophylla × E. grandis was planted in three trials and phenotyped for δ(13) C and growth traits. High-resolution genetic maps enabled us to target genomic regions underlying δ(13) C quantitative trait loci (QTLs) on the E. grandis genome. Of the 15 QTLs identified for δ(13) C, nine were stable across the environments and three displayed significant QTL-by-environment interaction, suggesting medium to high genetic determinism for this trait. Only one colocalization was found between growth and δ(13) C. Gene ontology (GO) term enrichment analysis suggested candidate genes related to foliar δ(13) C, including two involved in the regulation of stomatal movements. This study provides the first report of the genetic architecture of δ(13) C and its relation to growth in Eucalyptus. The low correlations found between the two traits at phenotypic and genetic levels suggest the possibility of improving the WUE of Eucalyptus varieties without having an impact on breeding for growth.

  13. A parallel unbalanced digitization architecture to reduce the dynamic range of multiple signals

    Science.gov (United States)

    Vallérian, Mathieu; HuÅ£u, Florin; Villemaud, Guillaume; Miscopein, Benoît; Risset, Tanguy

    2016-05-01

    Technologies employed in urban sensor networks are permanently evolving, and thus the gateways employed to collect data in such kind of networks have to be very flexible in order to be compliant with the new communication standards. A convenient way to do that is to digitize all the received signals in one shot and then to digitally perform the signal processing, as it is done in software-defined radio (SDR). All signals can be emitted with very different features (bandwidth, modulation type, and power level) in order to respond to the various propagation conditions. Their difference in terms of power levels is a problem when digitizing them together, as no current commercial analog-to-digital converter (ADC) can provide a fine enough resolution to digitize this high dynamic range between the weakest possible signal in the presence of a stronger signal. This paper presents an RF front end receiver architecture capable of handling this problem by using two ADCs of lower resolutions. The architecture is validated through a set of simulations using Keysight's ADS software. The main validation criterion is the bit error rate comparison with a classical receiver.

  14. Concurrent Smalltalk on the Message-Driven Processor

    Science.gov (United States)

    1991-09-01

    NDPSim -x 2 -y 2 - maize Ox1O00 ::Coamoa:Coamoa.m NewFact.mdp Message-Driven Processor Simulator Version 7.0 Rev B Accompanies MDP Architecture Document...it could peel invocations of recursive functions forever. However, the sin- gle pass of inlining does not mean that functions are only inlined one

  15. Processor Management in the Tera MTA Computer System,

    Science.gov (United States)

    1993-01-01

    This paper describes the processor scheduling issues specific to the Tera MTA (Multi Threaded Architecture) computer system and presents solutions to...classic scheduling problems. The Tera MTA exploits parallelism at all levels, from fine-grained instruction-level parallelism within a single

  16. Facilitating the comparison of multiple visual items on screen: the example of electronic architectural plan correction.

    Science.gov (United States)

    Fleury, Sylvain; Jamet, Éric

    2014-05-01

    This paper describes two experiments designed to (1) ascertain whether the way in which architectural plans are displayed on a computer screen influences the quality of their correction by humans, and (2) identify the visual exploration strategies adopted in this type of task. Results of the first "spot the difference" experiment showed that superimposing the plans yielded better error correction performances than displaying them side by side. Furthermore, a sequential display mode, where the second plan only gradually appeared on the screen, improved error search effectiveness. In the second experiment, eye movement recordings revealed that superimposition increased plan comparison efficiency by making it easier to establish coreference between the two sources of information. The improvement in effectiveness in the sequential condition was shown to be linked to the attentional guidance afforded by this display mode, which helped users to make a more thorough exploration of the plans.

  17. ARCHITECTURE AND DYNAMICS OF KEPLER'S CANDIDATE MULTIPLE TRANSITING PLANET SYSTEMS

    Energy Technology Data Exchange (ETDEWEB)

    Lissauer, Jack J.; Jenkins, Jon M.; Borucki, William J.; Bryson, Stephen T.; Howell, Steve B. [NASA Ames Research Center, Moffett Field, CA 94035 (United States); Ragozzine, Darin; Holman, Matthew J.; Carter, Joshua A. [Harvard-Smithsonian Center for Astrophysics, Cambridge, MA 02138 (United States); Fabrycky, Daniel C.; Fortney, Jonathan J. [Department of Astronomy and Astrophysics, University of California, Santa Cruz, CA 95064 (United States); Steffen, Jason H. [Fermilab Center for Particle Astrophysics, Batavia, IL 60510 (United States); Ford, Eric B. [211 Bryant Space Science Center, University of Florida, Gainesville, FL 32611 (United States); Shporer, Avi [Las Cumbres Observatory Global Telescope Network, Santa Barbara, CA 93117 (United States); Rowe, Jason F.; Quintana, Elisa V.; Caldwell, Douglas A. [SETI Institute/NASA Ames Research Center, Moffett Field, CA 94035 (United States); Batalha, Natalie M. [Department of Physics and Astronomy, San Jose State University, San Jose, CA 95192 (United States); Ciardi, David [Exoplanet Science Institute/Caltech, Pasadena, CA 91125 (United States); Dunham, Edward W. [Lowell Observatory, Flagstaff, AZ 86001 (United States); Gautier, Thomas N. III, E-mail: Jack.Lissauer@nasa.gov [Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 91109 (United States); and others

    2011-11-01

    About one-third of the {approx}1200 transiting planet candidates detected in the first four months of Kepler data are members of multiple candidate systems. There are 115 target stars with two candidate transiting planets, 45 with three, 8 with four, and 1 each with five and six. We characterize the dynamical properties of these candidate multi-planet systems. The distribution of observed period ratios shows that the vast majority of candidate pairs are neither in nor near low-order mean-motion resonances. Nonetheless, there are small but statistically significant excesses of candidate pairs both in resonance and spaced slightly too far apart to be in resonance, particularly near the 2:1 resonance. We find that virtually all candidate systems are stable, as tested by numerical integrations that assume a nominal mass-radius relationship. Several considerations strongly suggest that the vast majority of these multi-candidate systems are true planetary systems. Using the observed multiplicity frequencies, we find that a single population of planetary systems that matches the higher multiplicities underpredicts the number of singly transiting systems. We provide constraints on the true multiplicity and mutual inclination distribution of the multi-candidate systems, revealing a population of systems with multiple super-Earth-size and Neptune-size planets with low to moderate mutual inclinations.

  18. Evaluation of the Intel Sandy Bridge-EP server processor

    CERN Document Server

    Jarp, S; Leduc, J; Nowak, A; CERN. Geneva. IT Department

    2012-01-01

    In this paper we report on a set of benchmark results recently obtained by CERN openlab when comparing an 8-core “Sandy Bridge-EP” processor with Intel’s previous microarchitecture, the “Westmere-EP”. The Intel marketing names for these processors are “Xeon E5-2600 processor series” and “Xeon 5600 processor series”, respectively. Both processors are produced in a 32nm process, and both platforms are dual-socket servers. Multiple benchmarks were used to get a good understanding of the performance of the new processor. We used both industry-standard benchmarks, such as SPEC2006, and specific High Energy Physics benchmarks, representing both simulation of physics detectors and data analysis of physics events. Before summarizing the results we must stress the fact that benchmarking of modern processors is a very complex affair. One has to control (at least) the following features: processor frequency, overclocking via Turbo mode, the number of physical cores in use, the use of logical cores ...

  19. Complex matrix multiplication operations with data pre-conditioning in a high performance computing architecture

    Science.gov (United States)

    Eichenberger, Alexandre E; Gschwind, Michael K; Gunnels, John A

    2014-02-11

    Mechanisms for performing a complex matrix multiplication operation are provided. A vector load operation is performed to load a first vector operand of the complex matrix multiplication operation to a first target vector register. The first vector operand comprises a real and imaginary part of a first complex vector value. A complex load and splat operation is performed to load a second complex vector value of a second vector operand and replicate the second complex vector value within a second target vector register. The second complex vector value has a real and imaginary part. A cross multiply add operation is performed on elements of the first target vector register and elements of the second target vector register to generate a partial product of the complex matrix multiplication operation. The partial product is accumulated with other partial products and a resulting accumulated partial product is stored in a result vector register.

  20. Multiple across-strain and within-strain QTLs suggest highly complex genetic architecture for hypoxia tolerance in channel catfish.

    Science.gov (United States)

    Wang, Xiaozhu; Liu, Shikai; Jiang, Chen; Geng, Xin; Zhou, Tao; Li, Ning; Bao, Lisui; Li, Yun; Yao, Jun; Yang, Yujia; Zhong, Xiaoxiao; Jin, Yulin; Dunham, Rex; Liu, Zhanjiang

    2017-02-01

    The ability to survive hypoxic conditions is important for various organisms, especially for aquatic animals. Teleost fish, representing more than 50 % of vertebrate species, are extremely efficient in utilizing low levels of dissolved oxygen in water. However, huge variations exist among various taxa of fish in their ability to tolerate hypoxia. In aquaculture, hypoxia tolerance is among the most important traits because hypoxia can cause major economic losses. Genetic enhancement for hypoxia tolerance in catfish is of great interest, but little was done with analysis of the genetic architecture of hypoxia tolerance. The objective of this study was to conduct a genome-wide association study to identify QTLs for hypoxia tolerance using the catfish 250K SNP array with channel catfish families from six strains. Multiple significant and suggestive QTLs were identified across and within strains. One significant QTL and four suggestive QTLs were identified across strains. Six significant QTLs and many suggestive QTLs were identified within strains. There were rare overlaps among the QTLs identified within the six strains, suggesting a complex genetic architecture of hypoxia tolerance. Overall, within-strain QTLs explained larger proportion of phenotypic variation than across-strain QTLs. Many of genes within these identified QTLs have known functions for regulation of oxygen metabolism and involvement in hypoxia responses. Pathway analysis indicated that most of these genes were involved in MAPK or PI3K/AKT/mTOR signaling pathways that were known to be important for hypoxia-mediated angiogenesis, cell proliferation, apoptosis and survival.

  1. Multi-Constraint multi-processor Resource Allocation

    NARCIS (Netherlands)

    Behrouzian, A.R.B.; Goswami, D.; Basten, T.; Geilen, M.; Ara, H.A.

    2015-01-01

    This work proposes a Multi-Constraint Resource Allocation (MuCoRA) method for applications from multiple domains onto multi-processors. In particular, we address a mapping problem for multiple throughput-constrained streaming applications and multiple latency-constrained feedback control application

  2. Software-Reconfigurable Processors for Spacecraft

    Science.gov (United States)

    Farrington, Allen; Gray, Andrew; Bell, Bryan; Stanton, Valerie; Chong, Yong; Peters, Kenneth; Lee, Clement; Srinivasan, Jeffrey

    2005-01-01

    A report presents an overview of an architecture for a software-reconfigurable network data processor for a spacecraft engaged in scientific exploration. When executed on suitable electronic hardware, the software performs the functions of a physical layer (in effect, acts as a software radio in that it performs modulation, demodulation, pulse-shaping, error correction, coding, and decoding), a data-link layer, a network layer, a transport layer, and application-layer processing of scientific data. The software-reconfigurable network processor is undergoing development to enable rapid prototyping and rapid implementation of communication, navigation, and scientific signal-processing functions; to provide a long-lived communication infrastructure; and to provide greatly improved scientific-instrumentation and scientific-data-processing functions by enabling science-driven in-flight reconfiguration of computing resources devoted to these functions. This development is an extension of terrestrial radio and network developments (e.g., in the cellular-telephone industry) implemented in software running on such hardware as field-programmable gate arrays, digital signal processors, traditional digital circuits, and mixed-signal application-specific integrated circuits (ASICs).

  3. Unravelling the multiple functions of the architecturally intricate Streptococcus pneumoniae β-galactosidase, BgaA.

    Directory of Open Access Journals (Sweden)

    Anirudh K Singh

    2014-09-01

    Full Text Available Bacterial cell-surface proteins play integral roles in host-pathogen interactions. These proteins are often architecturally and functionally sophisticated and yet few studies of such proteins involved in host-pathogen interactions have defined the domains or modules required for specific functions. Streptococcus pneumoniae (pneumococcus, an opportunistic pathogen that is a leading cause of community acquired pneumonia, otitis media and bacteremia, is decorated with many complex surface proteins. These include β-galactosidase BgaA, which is specific for terminal galactose residues β-1-4 linked to glucose or N-acetylglucosamine and known to play a role in pneumococcal growth, resistance to opsonophagocytic killing, and adherence. This study defines the domains and modules of BgaA that are required for these distinct contributions to pneumococcal pathogenesis. Inhibitors of β-galactosidase activity reduced pneumococcal growth and increased opsonophagocytic killing in a BgaA dependent manner, indicating these functions require BgaA enzymatic activity. In contrast, inhibitors increased pneumococcal adherence suggesting that BgaA bound a substrate of the enzyme through a distinct module or domain. Extensive biochemical, structural and cell based studies revealed two newly identified non-enzymatic carbohydrate-binding modules (CBMs mediate adherence to the host cell surface displayed lactose or N-acetyllactosamine. This finding is important to pneumococcal biology as it is the first adhesin-carbohydrate receptor pair identified, supporting the widely held belief that initial pneumococcal attachment is to a glycoconjugate. Perhaps more importantly, this is the first demonstration that a CBM within a carbohydrate-active enzyme can mediate adherence to host cells and thus this study identifies a new class of carbohydrate-binding adhesins and extends the paradigm of CBM function. As other bacterial species express surface-associated carbohydrate

  4. Globe hosts launch of new processor

    CERN Multimedia

    2006-01-01

    Launch of the quadecore processor chip at the Globe. On 14 November, in a series of major media events around the world, the chip-maker Intel launched its new 'quadcore' processor. For the regions of Europe, the Middle East and Africa, the day-long launch event took place in CERN's Globe of Science and Innovation, with over 30 journalists in attendance, coming from as far away as Johannesburg and Dubai. CERN was a significant choice for the event: the first tests of this new generation of processor in Europe had been made at CERN over the preceding months, as part of CERN openlab, a research partnership with leading IT companies such as Intel, HP and Oracle. The event also provided the opportunity for the journalists to visit ATLAS and the CERN Computer Centre. The strategy of putting multiple processor cores on the same chip, which has been pursued by Intel and other chip-makers in the last few years, represents an important departure from the more traditional improvements in the sheer speed of such chips. ...

  5. Design of an Elliptic Curve Cryptography Processor for RFID Tag Chips

    Directory of Open Access Journals (Sweden)

    Zilong Liu

    2014-09-01

    Full Text Available Radio Frequency Identification (RFID is an important technique for wireless sensor networks and the Internet of Things. Recently, considerable research has been performed in the combination of public key cryptography and RFID. In this paper, an efficient architecture of Elliptic Curve Cryptography (ECC Processor for RFID tag chip is presented. We adopt a new inversion algorithm which requires fewer registers to store variables than the traditional schemes. A new method for coordinate swapping is proposed, which can reduce the complexity of the controller and shorten the time of iterative calculation effectively. A modified circular shift register architecture is presented in this paper, which is an effective way to reduce the area of register files. Clock gating and asynchronous counter are exploited to reduce the power consumption. The simulation and synthesis results show that the time needed for one elliptic curve scalar point multiplication over GF(2163 is 176.7 K clock cycles and the gate area is 13.8 K with UMC 0.13 μm Complementary Metal Oxide Semiconductor (CMOS technology. Moreover, the low power and low cost consumption make the Elliptic Curve Cryptography Processor (ECP a prospective candidate for application in the RFID tag chip.

  6. Design of an Elliptic Curve Cryptography processor for RFID tag chips.

    Science.gov (United States)

    Liu, Zilong; Liu, Dongsheng; Zou, Xuecheng; Lin, Hui; Cheng, Jian

    2014-09-26

    Radio Frequency Identification (RFID) is an important technique for wireless sensor networks and the Internet of Things. Recently, considerable research has been performed in the combination of public key cryptography and RFID. In this paper, an efficient architecture of Elliptic Curve Cryptography (ECC) Processor for RFID tag chip is presented. We adopt a new inversion algorithm which requires fewer registers to store variables than the traditional schemes. A new method for coordinate swapping is proposed, which can reduce the complexity of the controller and shorten the time of iterative calculation effectively. A modified circular shift register architecture is presented in this paper, which is an effective way to reduce the area of register files. Clock gating and asynchronous counter are exploited to reduce the power consumption. The simulation and synthesis results show that the time needed for one elliptic curve scalar point multiplication over GF(2163) is 176.7 K clock cycles and the gate area is 13.8 K with UMC 0.13 μm Complementary Metal Oxide Semiconductor (CMOS) technology. Moreover, the low power and low cost consumption make the Elliptic Curve Cryptography Processor (ECP) a prospective candidate for application in the RFID tag chip.

  7. A Systolic Array RLS Processor

    OpenAIRE

    Asai, T.; Matsumoto, T.

    2000-01-01

    This paper presents the outline of the systolic array recursive least-squares (RLS) processor prototyped primarily with the aim of broadband mobile communication applications. To execute the RLS algorithm effectively, this processor uses an orthogonal triangularization technique known in matrix algebra as QR decomposition for parallel pipelined processing. The processor board comprises 19 application-specific integrated circuit chips, each with approximately one million gates. Thirty-two bit ...

  8. AMD's 64-bit Opteron processor

    CERN Document Server

    CERN. Geneva

    2003-01-01

    This talk concentrates on issues that relate to obtaining peak performance from the Opteron processor. Compiler options, memory layout, MPI issues in multi-processor configurations and the use of a NUMA kernel will be covered. A discussion of recent benchmarking projects and results will also be included.BiographiesDavid RichDavid directs AMD's efforts in high performance computing and also in the use of Opteron processors...

  9. Emerging Trends in Embedded Processors

    Directory of Open Access Journals (Sweden)

    Gurvinder Singh

    2014-05-01

    Full Text Available An Embedded Processors is simply a µProcessors that has been “Embedded” into a device. Embedded systems are important part of human life. For illustration, one cannot visualize life without mobile phones for personal communication. Embedded systems are used in many places like healthcare, automotive, daily life, and in different offices and industries.Embedded Processors develop new research area in the field of hardware designing.

  10. Multi-Softcore Architecture on FPGA

    Directory of Open Access Journals (Sweden)

    Mouna Baklouti

    2014-01-01

    Full Text Available To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication.

  11. Communication Efficient Multi-processor FFT

    Science.gov (United States)

    Lennart Johnsson, S.; Jacquemin, Michel; Krawitz, Robert L.

    1992-10-01

    Computing the fast Fourier transform on a distributed memory architecture by a direct pipelined radix-2, a bi-section, or a multisection algorithm, all yield the same communications requirement, if communication for all FFT stages can be performed concurrently, the input data is in normal order, and the data allocation is consecutive. With a cyclic data allocation, or bit-reversed input data and a consecutive allocation, multi-sectioning offers a reduced communications requirement by approximately a factor of two. For a consecutive data allocation, normal input order, a decimation-in-time FFT requires that P/ N + d-2 twiddle factors be stored for P elements distributed evenly over N processors, and the axis that is subject to transformation be distributed over 2 d processors. No communication of twiddle factors is required. The same storage requirements hold for a decimation-in-frequency FFT, bit-reversed input order, and consecutive data allocation. The opposite combination of FFT type and data ordering requires a factor of log 2N more storage for N processors. The peak performance for a Connection Machine system CM-200 implementation is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision for unordered transforms local to each processor. The corresponding execution rates for ordered transforms are 11.1 Gflops/s and 8.5 Gflops/s, respectively. For distributed one- and two-dimensional transforms the peak performance for unordered transforms exceeds 5 Gflops/s in 32-bit precision and 3 Gflops/s in 64-bit precision. Three-dimensional transforms execute at a slightly lower rate. Distributed ordered transforms execute at a rate of about {1}/{2}to {2}/{3} of the unordered transforms.

  12. Transitions in Land Use Architecture under Multiple Human Driving Forces in a Semi-Arid Zone

    Directory of Open Access Journals (Sweden)

    Issa Ouedraogo

    2015-07-01

    Full Text Available The present study aimed to detect the main shifts in land-use architecture and assess the factors behind the changes in typical tropical semi-arid land in Burkina Faso. Three sets of time-series LANDSAT data over a 23-year period were used to detect land use changes and their underpinning drivers in multifunctional but vulnerable ecologies. Group discussions in selected villages were organized for mapping output interpretation and collection of essential drivers of change as perceived by local populations. Results revealed profound changes and transitions during the study period. During the last decade, shrub and wood savannahs exhibited high net changes (39% and −37% respectively with a weak net positive change for cropland (only 2%, while cropland and shrub savannah exhibited high swap (8% and 16%. This suggests that the area of cropland remained almost unchanged but was subject to relocation, wood savannah decreased drastically, and shrub savannah increased exponentially. Cropland exhibited a null net persistence while shrub and wood savannahs exhibited positive and negative net persistence (1.91 and −10.24, respectively, indicating that there is movement toward agricultural intensification and wood savannah tended to disappear to the benefit of shrub savannah. Local people are aware of the changes that have occurred and support the idea that illegal wood cutting and farming are inappropriate farming practices associated with immigration; absence of alternative cash generation sources, overgrazing and increasing demand for wood energy are driving the changes in their ecosystems. Policies that integrate restoration and conservation of natural ecosystems and promote sustainable agroforestry practices in the study zone are highly recommended.

  13. Coordinated Energy Management in Heterogeneous Processors

    Directory of Open Access Journals (Sweden)

    Indrani Paul

    2014-01-01

    Full Text Available This paper examines energy management in a heterogeneous processor consisting of an integrated CPU–GPU for high-performance computing (HPC applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types – a new and less understood problem. We examine the intra-node CPU–GPU frequency sensitivity of HPC applications on tightly coupled CPU–GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU–GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED2 product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.

  14. Efficiency of Cache Mechanism for Network Processors

    Institute of Scientific and Technical Information of China (English)

    XU Bo; CHANG Jian; HUANG Shimeng; XUE Yibo; LI Jun

    2009-01-01

    With the explosion of network bandwidth and the ever-changing requirements for diverse net-work-based applications, the traditional processing architectures, i.e., general purpose processor (GPP) and application specific integrated circuits (ASIC) cannot provide sufficient flexibility and high performance at the same time. Thus, the network processor (NP) has emerged as an altemative to meet these dual demands for today's network processing. The NP combines embedded multi-threaded cores with a dch memory hierarchy that can adapt to different networking circumstances when customized by the application developers. In to-day's NP architectures, muitithreading prevails over cache mechanism, which has achieved great success in GPP to hide memory access latencies. This paper focuses on the efficiency of the cache mechanism in an NP. Theoretical timing models of packet processing are established for evaluating cache efficiency and experi-ments are performed based on real-life network backbone traces. Testing results show that an improvement of neady 70% can be gained in throughput with assistance from the cache mechanism. Accordingly, the cache mechanism is still efficient and irreplaceable in network processing, despite the existing of multithreading.

  15. Reconfigurable signal processor designs for advanced digital array radar systems

    Science.gov (United States)

    Suarez, Hernan; Zhang, Yan (Rockee); Yu, Xining

    2017-05-01

    The new challenges originated from Digital Array Radar (DAR) demands a new generation of reconfigurable backend processor in the system. The new FPGA devices can support much higher speed, more bandwidth and processing capabilities for the need of digital Line Replaceable Unit (LRU). This study focuses on using the latest Altera and Xilinx devices in an adaptive beamforming processor. The field reprogrammable RF devices from Analog Devices are used as analog front end transceivers. Different from other existing Software-Defined Radio transceivers on the market, this processor is designed for distributed adaptive beamforming in a networked environment. The following aspects of the novel radar processor will be presented: (1) A new system-on-chip architecture based on Altera's devices and adaptive processing module, especially for the adaptive beamforming and pulse compression, will be introduced, (2) Successful implementation of generation 2 serial RapidIO data links on FPGA, which supports VITA-49 radio packet format for large distributed DAR processing. (3) Demonstration of the feasibility and capabilities of the processor in a Micro-TCA based, SRIO switching backplane to support multichannel beamforming in real-time. (4) Application of this processor in ongoing radar system development projects, including OU's dual-polarized digital array radar, the planned new cylindrical array radars, and future airborne radars.

  16. A 16-Bit Fully Functional Single Cycle Processor

    Directory of Open Access Journals (Sweden)

    Nidhi Maheshwari

    2011-08-01

    Full Text Available The existing commercial microprocessors are provided as black box units, with which users are unable to monitor internal signals and operation process, neither can they modify the original structure. Inorder to solve this problem 16-bit fully functional single cycle processor is designed in terms of its architecture and its functional capabilities. The procedure of design and verification for a 16-bit processor is introduced in this paper. The key architecture elements are being described, as well as the hardware block diagram and internal structure. The summary of instruction set is presented. This processor is modify as a Very High Speed Integrated Circuit Hardware Description Language (VHDL and gives access to every internal signal. In order to consume fewer resources, the design of arithmetic logical unit (ALU is optimized. The RTL views and verified simulation results of processor are shown in this paper. The synthesis report of the design is also described. The design architecture is written in Very High Speed Integrated Circuit Hardware Description Language (VHDL code using Xilinx ISE 9.2i tool for synthesis and simulation.

  17. Network-Attached Solid-State Recorder Architecture

    Science.gov (United States)

    Cox, Brian

    2008-01-01

    A document discusses placing memory modules on the high-speed serial interconnect, which is used by a spacecraft s computer elements for inter-processor communications, to allow all multiple computer system architectures to access the spacecraft data storage at the same time. Each memory board is identical electrically and receives its bus ID upon connection to the system. The computer elements are configured in a similar fashion. The architecture allows for multiple memory boards to be accessed simultaneously by different computer elements, and results in a scalable, strong, fault-tolerant system. The IEEE-1393 ring bus can be routed so that multiple card failures can occur and the mass memory storage will still function.

  18. Parallel Processor for 3D Recovery from Optical Flow

    Directory of Open Access Journals (Sweden)

    Jose Hugo Barron-Zambrano

    2009-01-01

    Full Text Available 3D recovery from motion has received a major effort in computer vision systems in the recent years. The main problem lies in the number of operations and memory accesses to be performed by the majority of the existing techniques when translated to hardware or software implementations. This paper proposes a parallel processor for 3D recovery from optical flow. Its main feature is the maximum reuse of data and the low number of clock cycles to calculate the optical flow, along with the precision with which 3D recovery is achieved. The results of the proposed architecture as well as those from processor synthesis are presented.

  19. Avionics Architecture for Exploration Project

    Data.gov (United States)

    National Aeronautics and Space Administration — The Avionics Architectures for Exploration Project team will develop a system level environment and architecture that will accommodate equipment from multiple...

  20. Multiple microscopic approaches demonstrate linkage between chromoplast architecture and carotenoid composition in diverse Capsicum annuum fruit.

    Science.gov (United States)

    Kilcrease, James; Collins, Aaron M; Richins, Richard D; Timlin, Jerilyn A; O'Connell, Mary A

    2013-12-01

    Increased accumulation of specific carotenoids in plastids through plant breeding or genetic engineering requires an understanding of the limitations that storage sites for these compounds may impose on that accumulation. Here, using Capsicum annuum L. fruit, we demonstrate directly the unique sub-organellar accumulation sites of specific carotenoids using live cell hyperspectral confocal Raman microscopy. Further, we show that chromoplasts from specific cultivars vary in shape and size, and these structural variations are associated with carotenoid compositional differences. Live-cell imaging utilizing laser scanning confocal (LSCM) and confocal Raman microscopy, as well as fixed tissue imaging by scanning and transmission electron microscopy (SEM and TEM), all demonstrated morphological differences with high concordance for the measurements across the multiple imaging modalities. These results reveal additional opportunities for genetic controls on fruit color and carotenoid-based phenotypes.

  1. Radio Astronomy Data Model for Single-Dish Multiple-Feed Telescopes, and Robledo Archive Architecture

    CERN Document Server

    Santander-Vela, J D; Gómez, J F; Verdes-Montenegro, L; Leon, S; Gutíerrez, R; Rodrigo, C; Morata, O; Solano, E; Suárez, O

    2008-01-01

    All the effort that the astrophysical community has put into the development of the Virtual Observatory (VO) has surpassed the non-return point: the VO is a reality today, and an initiative that will self-sustain, and to which all archival projects must adhere. We have started the design of the scientific archive for the DSS-63 70-m antenna at NASA's DSN station in Robledo de Chavela (Madrid). Here we show how we can use all VO proposed data models to build a VO-compliant single-dish, multiple-feed, radio astronomical archive data model (RADAMS) suitable for the archival needs of the antenna. We also propose an exhaustive list of Universal Content Descriptors (UCDs) and FITS keywords for all relevant metadata. We will further refine this data model with the experience that we will gain from that implementation.

  2. Scientific Computing Kernels on the Cell Processor

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel W.; Shalf, John; Oliker, Leonid; Kamil, Shoaib; Husbands, Parry; Yelick, Katherine

    2007-04-04

    The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on a 3.2GHz Cell blade. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

  3. Simultaneous multithreaded processor enhanced for multimedia applications

    Science.gov (United States)

    Mombers, Friederich; Thomas, Michel

    1999-12-01

    The paper proposes a new media processor architecture specifically designed to handle state-of-the-art multimedia encoding and decoding tasks. To achieve this, the architecture efficiently exploit Data-, Instruction- and Thread-Level parallelisms while continuously adapting its computational resources to reach the most appropriate parallelism level among all the concurrent encoding/decoding processes. Looking at the implementation constraints, several critical choices were adopted that solve the interconnection delay problem, lower the cache misses and pipeline stalls effects and reduce register files and memory size by adopting a clustered Simultaneous Multithreaded Architecture. We enhanced the classic model to exploit both Instruction and Data Level Parallelism through vector instructions. The vector extension is well justified for multimedia workload and improves code density, crossbars complexity, register file ports and decoding logic area while it still provides an efficient way to fully exploit a large set of functional units. An MPEG-2 encoding algorithms based on Hybrid Genetic search has been implemented that show the efficiency of the architecture to adapt its resources allocation to better fulfill the application requirements.

  4. Template Matching on Parallel Architectures,

    Science.gov (United States)

    1985-07-01

    memory. The processors run asynchronously. Thus according to Hynn’s categories the Butterfl . is a MIMD machine. The processors of the Butterfly are...Generalized Butterfly Architecture This section describes timings for pattern matching on the generalized Butterfl .. Ihe implementations on the Butterfly...these algorithms. Thus the best implementation of the techniques on the generalized Butterfl % are the same as the implementation on the real Butterfly

  5. A DVB-RCS Multi-Channel, Multi-Frequency Demodulator Based on a Multi-Tasking Hardware-Software Architecture Using a System on Programmable Chip Technology

    Science.gov (United States)

    van Doninck, A.; Dendoncker, M.; Adriaensen, F.; Delbeke, P.; Rolle, A.; Craey, T.; Krekels, S.

    : This paper highlights a multi-channel, multi-frequency DVB-RCS compatible burst demodulator implementation in a System On Programmable Chip (SOPC) technology. The core of the demodulator architecture is a SOPC device with an ARM processor located internally in FPGA. The ARM processor performs the hard real time signal processing functions and is supported by a COTS standard PC based processor module running Linux/RT-Linux for the non-hard real-time demodulator functions. The implemented architecture differs completely from classic multi-channel solutions, in which the multi- channel functionality is realised by means of a multiple instantiation of the entire demodulator. The paper also discusses the followed methodology for the SOPC design. Keywords: DVB-RCS, multi-channel, multi-frequency, SOPC, FPGA, ARM, RT-Linux

  6. Issue Mechanism for Embedded Simultaneous Multithreading Processor

    Science.gov (United States)

    Zang, Chengjie; Imai, Shigeki; Frank, Steven; Kimura, Shinji

    Simultaneous Multithreading (SMT) technology enhances instruction throughput by issuing multiple instructions from multiple threads within one clock cycle. For in-order pipeline to each thread, SMT processors can provide large number of issued instructions close to or surpass than using out-of-order pipeline. In this work, we show an efficient issue logic for predicated instruction sequence with the parallel flag in each instruction, where the predicate register based issue control is adopted and the continuous instructions with the parallel flag of ‘0’ are executed in parallel. The flag is pre-defined by a compiler. Instructions from different threads are issued based on the round-robin order. We also introduce an Instruction Queue skip mechanism for thread if the queue is empty. Using this kind of issue logic, we designed a 6 threads, 7-stage, in-order pipeline processor. Based on this processor, we compare round-robin issue policy (RR(T1-Tn)) with other policies: thread one always has the highest priority (PR(T1)) and thread one or thread n has the highest priority in turn (PR(T1-Tn)). The results show that RR(T1-Tn) policy outperforms others and PR(T1-Tn) is almost the same to RR(T1-Tn) from the point of view of the issued instructions per cycle.

  7. Trade-Off Exploration for Target Tracking Application in a Customized Multiprocessor Architecture

    Directory of Open Access Journals (Sweden)

    Yassin El-Hillali

    2009-01-01

    Full Text Available This paper presents the design of an FPGA-based multiprocessor-system-on-chip (MPSoC architecture optimized for Multiple Target Tracking (MTT in automotive applications. An MTT system uses an automotive radar to track the speed and relative position of all the vehicles (targets within its field of view. As the number of targets increases, the computational needs of the MTT system also increase making it difficult for a single processor to handle it alone. Our implementation distributes the computational load among multiple soft processor cores optimized for executing specific computational tasks. The paper explains how we designed and profiled the MTT application to partition it among different processors. It also explains how we applied different optimizations to customize the individual processor cores to their assigned tasks and to assess their impact on performance and FPGA resource utilization. The result is a complete MTT application running on an optimized MPSoC architecture that fits in a contemporary medium-sized FPGA and that meets the application's real-time constraints.

  8. Trade-Off Exploration for Target Tracking Application in a Customized Multiprocessor Architecture

    Directory of Open Access Journals (Sweden)

    Saghir MazenAR

    2009-01-01

    Full Text Available Abstract This paper presents the design of an FPGA-based multiprocessor-system-on-chip (MPSoC architecture optimized for Multiple Target Tracking (MTT in automotive applications. An MTT system uses an automotive radar to track the speed and relative position of all the vehicles (targets within its field of view. As the number of targets increases, the computational needs of the MTT system also increase making it difficult for a single processor to handle it alone. Our implementation distributes the computational load among multiple soft processor cores optimized for executing specific computational tasks. The paper explains how we designed and profiled the MTT application to partition it among different processors. It also explains how we applied different optimizations to customize the individual processor cores to their assigned tasks and to assess their impact on performance and FPGA resource utilization. The result is a complete MTT application running on an optimized MPSoC architecture that fits in a contemporary medium-sized FPGA and that meets the application's real-time constraints.

  9. Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

    Science.gov (United States)

    Massanes, Francesc; Cadennes, Marie; Brankov, Jovan G.

    2011-07-01

    We describe and evaluate a fast implementation of a classical block-matching motion estimation algorithm for multiple graphical processing units (GPUs) using the compute unified device architecture computing engine. The implemented block-matching algorithm uses summed absolute difference error criterion and full grid search (FS) for finding optimal block displacement. In this evaluation, we compared the execution time of a GPU and CPU implementation for images of various sizes, using integer and noninteger search grids. The results show that use of a GPU card can shorten computation time by a factor of 200 times for integer and 1000 times for a noninteger search grid. The additional speedup for a noninteger search grid comes from the fact that GPU has built-in hardware for image interpolation. Further, when using multiple GPU cards, the presented evaluation shows the importance of the data splitting method across multiple cards, but an almost linear speedup with a number of cards is achievable. In addition, we compared the execution time of the proposed FS GPU implementation with two existing, highly optimized nonfull grid search CPU-based motion estimations methods, namely implementation of the Pyramidal Lucas Kanade Optical flow algorithm in OpenCV and simplified unsymmetrical multi-hexagon search in H.264/AVC standard. In these comparisons, FS GPU implementation still showed modest improvement even though the computational complexity of FS GPU implementation is substantially higher than non-FS CPU implementation. We also demonstrated that for an image sequence of 720 × 480 pixels in resolution commonly used in video surveillance, the proposed GPU implementation is sufficiently fast for real-time motion estimation at 30 frames-per-second using two NVIDIA C1060 Tesla GPU cards.

  10. Does supporting multiple student strategies lead to greater learning and motivation? Investigating a source of complexity in the architecture of intelligent tutoring systems

    NARCIS (Netherlands)

    Waalkens, Maaike; Aleven, Vincent; Taatgen, Niels

    Intelligent tutoring systems (ITS) support students in learning a complex problem-solving skill. One feature that makes an ITS architecturally complex, and hard to build, is support for strategy freedom, that is, the ability to let students pursue multiple solution strategies within a given problem.

  11. Does Supporting Multiple Student Strategies Lead to Greater Learning and Motivation? Investigating a Source of Complexity in the Architecture of Intelligent Tutoring Systems

    Science.gov (United States)

    Waalkens, Maaike; Aleven, Vincent; Taatgen, Niels

    2013-01-01

    Intelligent tutoring systems (ITS) support students in learning a complex problem-solving skill. One feature that makes an ITS architecturally complex, and hard to build, is support for strategy freedom, that is, the ability to let students pursue multiple solution strategies within a given problem. But does greater freedom mean that students…

  12. Does supporting multiple student strategies lead to greater learning and motivation? Investigating a source of complexity in the architecture of intelligent tutoring systems

    NARCIS (Netherlands)

    Waalkens, Maaike; Aleven, Vincent; Taatgen, Niels

    2013-01-01

    Intelligent tutoring systems (ITS) support students in learning a complex problem-solving skill. One feature that makes an ITS architecturally complex, and hard to build, is support for strategy freedom, that is, the ability to let students pursue multiple solution strategies within a given problem.

  13. Using multiple hydrogen bonding cross-linkers to access reversibly responsive three dimensional graphene oxide architecture

    Science.gov (United States)

    Han, Junkai; Shen, Yongtao; Feng, Wei

    2016-07-01

    Three-dimensional (3D) graphene materials have attracted a lot of attention for efficiently utilizing inherent properties of graphene sheets. However, 3D graphene materials reported in the previous literature are constructed through covalent or weak non-covalent interactions, causing permanent structure/property changes. In this paper, a novel 3D graphene material of dynamic interactions between lamellas with 2-ureido-4[1H]-pyrimidinone as a supra-molecular motif has been synthesized. This 3D graphene material shows enhanced sheet interactions while the cross-linking takes place. With proper solvent stimulation, the integrated 3D graphene material can disassemble as isolated sheets. The driving force for the 3D structure assembly or disassembly is considered to be the forming or breaking of the multiple hydrogen bonding pairs. Furthermore, the 3D material is used as an intelligent dye adsorber to adsorb methylene blue and release it. The controllable and reversible characteristic of this 3D graphene material may open an avenue to the synthesis and application of novel intelligent materials.Three-dimensional (3D) graphene materials have attracted a lot of attention for efficiently utilizing inherent properties of graphene sheets. However, 3D graphene materials reported in the previous literature are constructed through covalent or weak non-covalent interactions, causing permanent structure/property changes. In this paper, a novel 3D graphene material of dynamic interactions between lamellas with 2-ureido-4[1H]-pyrimidinone as a supra-molecular motif has been synthesized. This 3D graphene material shows enhanced sheet interactions while the cross-linking takes place. With proper solvent stimulation, the integrated 3D graphene material can disassemble as isolated sheets. The driving force for the 3D structure assembly or disassembly is considered to be the forming or breaking of the multiple hydrogen bonding pairs. Furthermore, the 3D material is used as an

  14. The architecture of gene regulatory variation across multiple human tissues: the MuTHER study.

    Directory of Open Access Journals (Sweden)

    Alexandra C Nica

    Full Text Available While there have been studies exploring regulatory variation in one or more tissues, the complexity of tissue-specificity in multiple primary tissues is not yet well understood. We explore in depth the role of cis-regulatory variation in three human tissues: lymphoblastoid cell lines (LCL, skin, and fat. The samples (156 LCL, 160 skin, 166 fat were derived simultaneously from a subset of well-phenotyped healthy female twins of the MuTHER resource. We discover an abundance of cis-eQTLs in each tissue similar to previous estimates (858 or 4.7% of genes. In addition, we apply factor analysis (FA to remove effects of latent variables, thus more than doubling the number of our discoveries (1,822 eQTL genes. The unique study design (Matched Co-Twin Analysis--MCTA permits immediate replication of eQTLs using co-twins (93%-98% and validation of the considerable gain in eQTL discovery after FA correction. We highlight the challenges of comparing eQTLs between tissues. After verifying previous significance threshold-based estimates of tissue-specificity, we show their limitations given their dependency on statistical power. We propose that continuous estimates of the proportion of tissue-shared signals and direct comparison of the magnitude of effect on the fold change in expression are essential properties that jointly provide a biologically realistic view of tissue-specificity. Under this framework we demonstrate that 30% of eQTLs are shared among the three tissues studied, while another 29% appear exclusively tissue-specific. However, even among the shared eQTLs, a substantial proportion (10%-20% have significant differences in the magnitude of fold change between genotypic classes across tissues. Our results underline the need to account for the complexity of eQTL tissue-specificity in an effort to assess consequences of such variants for complex traits.

  15. The Architecture of Gene Regulatory Variation across Multiple Human Tissues: The MuTHER Study

    Science.gov (United States)

    Nica, Alexandra C.; Parts, Leopold; Glass, Daniel; Nisbet, James; Barrett, Amy; Sekowska, Magdalena; Travers, Mary; Potter, Simon; Grundberg, Elin; Small, Kerrin; Hedman, Åsa K.; Bataille, Veronique; Tzenova Bell, Jordana; Surdulescu, Gabriela; Dimas, Antigone S.; Ingle, Catherine; Nestle, Frank O.; di Meglio, Paola; Min, Josine L.; Wilk, Alicja; Hammond, Christopher J.; Hassanali, Neelam; Yang, Tsun-Po; Montgomery, Stephen B.; O'Rahilly, Steve; Lindgren, Cecilia M.; Zondervan, Krina T.; Soranzo, Nicole; Barroso, Inês; Durbin, Richard; Ahmadi, Kourosh; Deloukas, Panos; McCarthy, Mark I.; Dermitzakis, Emmanouil T.; Spector, Timothy D.

    2011-01-01

    While there have been studies exploring regulatory variation in one or more tissues, the complexity of tissue-specificity in multiple primary tissues is not yet well understood. We explore in depth the role of cis-regulatory variation in three human tissues: lymphoblastoid cell lines (LCL), skin, and fat. The samples (156 LCL, 160 skin, 166 fat) were derived simultaneously from a subset of well-phenotyped healthy female twins of the MuTHER resource. We discover an abundance of cis-eQTLs in each tissue similar to previous estimates (858 or 4.7% of genes). In addition, we apply factor analysis (FA) to remove effects of latent variables, thus more than doubling the number of our discoveries (1,822 eQTL genes). The unique study design (Matched Co-Twin Analysis—MCTA) permits immediate replication of eQTLs using co-twins (93%–98%) and validation of the considerable gain in eQTL discovery after FA correction. We highlight the challenges of comparing eQTLs between tissues. After verifying previous significance threshold-based estimates of tissue-specificity, we show their limitations given their dependency on statistical power. We propose that continuous estimates of the proportion of tissue-shared signals and direct comparison of the magnitude of effect on the fold change in expression are essential properties that jointly provide a biologically realistic view of tissue-specificity. Under this framework we demonstrate that 30% of eQTLs are shared among the three tissues studied, while another 29% appear exclusively tissue-specific. However, even among the shared eQTLs, a substantial proportion (10%–20%) have significant differences in the magnitude of fold change between genotypic classes across tissues. Our results underline the need to account for the complexity of eQTL tissue-specificity in an effort to assess consequences of such variants for complex traits. PMID:21304890

  16. Reconfigurable Architecture for Minimizing the Network Delays in the Multi-core Systems

    Directory of Open Access Journals (Sweden)

    D. Venkatavara Prasad

    2015-03-01

    Full Text Available Noc architecture performs better comparing to bus based when the number of processors is small. On the other hand bus based performs better than noc when number of the processors is large. This leads to new architecture which is hybrid bus based architecture where each node is packet switched in a mesh network of noc architecture that contains bus based system with small number of processors. Few results showed that this hybrid architecture performs optimally better than either purely noc based or purely bus based architecture. Hybrid architecture contains a processor connected to the bus, the bus in turn connected to the router. Each processor contains a private Level 1 (L1 cache. When hybrid architecture is preferable, the optimal number of processors on each bus subsystem varies based on the application. Hence the proposed architecture allows scalable bus-based multiprocessor subsystems on each node in the NoC. This system provides a multi-bus execution environment where each processor is connected to a bus and the bus-based subsystems communicate via routers connected in a mesh-style configuration. The system can be reconfigured to vary the number of bus subsystems and the number of processors on each subsystem. This architecture provides reliability and adaptability and reduces the network delays. Implementing and presenting the details of architecture and experimental results using ns2 indicating the advantages of this architecture.

  17. Tailoring Software for Multiple Processor Systems

    Science.gov (United States)

    1982-10-01

    the DSN/SPICE Group at Carnegie-Mellon University. [137) W.E. Riddle, J.H. Sayler , A.R. Segal, A.M. Stavely, J.C. Wileden. A Description Scheme to Aid...IEEE Trans. Softw. Eng., 1982. To be published. [146) Gary H. Sockut and Robert P. Goldberg. Database Reorganization- Principles and Practice. Journal

  18. Stepping motor control processor reference manual. Volume I

    Energy Technology Data Exchange (ETDEWEB)

    Holloway, F.W.; VanArsdall, P.J.; Suski, G.J.; Gant, R.G.; Rash, M.

    1980-06-06

    This manual is intended to serve several purposes. The first goal is to describe the capabilities and operation of the SMC processor package from an operator or user point of view. Secondly, the manual will describe in some detail the basic hardware elements and how they can be used effectively to implement a step motor control system. Practical information on the use, installation and checkout of the hardware set is presented in the following sections along with programming suggestions. Available related system software is described in this manual for reference and as an aid in understanding the system architecture. Section two presents an overview and operations manual of the SMC processor describing its composition and functional capabilities. Section three contains hardware descriptions in some detail for the LLL-designed hardware used in the SMC processor. Basic theory of operation and important features are explained.

  19. Modal Processor Effects Inspired by Hammond Tonewheel Organs

    Directory of Open Access Journals (Sweden)

    Kurt James Werner

    2016-06-01

    Full Text Available In this design study, we introduce a novel class of digital audio effects that extend the recently introduced modal processor approach to artificial reverberation and effects processing. These pitch and distortion processing effects mimic the design and sonics of a classic additive-synthesis-based electromechanical musical instrument, the Hammond tonewheel organ. As a reverb effect, the modal processor simulates a room response as the sum of resonant filter responses. This architecture provides precise, interactive control over the frequency, damping, and complex amplitude of each mode. Into this framework, we introduce two types of processing effects: pitch effects inspired by the Hammond organ’s equal tempered “tonewheels”, “drawbar” tone controls, vibrato/chorus circuit, and distortion effects inspired by the pseudo-sinusoidal shape of its tonewheels and electromagnetic pickup distortion. The result is an effects processor that imprints the Hammond organ’s sonics onto any audio input.

  20. Fault Tolerance Mechanism in Chip Many-Core Processors

    Institute of Scientific and Technical Information of China (English)

    ZHANG Lei; HAN Yinhe; LI Huawei; LI Xiaowei

    2007-01-01

    As semiconductor technology advances, there will be billions of transistors on a single chip. Chip many-core processors are emerging to take advantage of these greater transistor densities to deliver greater performance. Effective fault tolerance techniques are essential to improve the yield of such complex chips. In this paper, a core-level redundancy scheme called N+M is proposed to improve N-core processors'yield by providing M spare cores. In such architecture, topology is an important factor because it greatly affects the processors'performance. The concept of logical topology and a topology reconfiguration problem are introduced, which is able to transparently provide target topology with lowest performance degradation as the presence of faulty cores on-chip. A row rippling and column stealing (RRCS) algorithm is also proposed. Results show that PRCS can give solutions with average 13.8% degradation with negligible computing time.

  1. Microprocessor architectures RISC, CISC and DSP

    CERN Document Server

    Heath, Steve

    1995-01-01

    'Why are there all these different processor architectures and what do they all mean? Which processor will I use? How should I choose it?' Given the task of selecting an architecture or design approach, both engineers and managers require a knowledge of the whole system and an explanation of the design tradeoffs and their effects. This is information that rarely appears in data sheets or user manuals. This book fills that knowledge gap.Section 1 provides a primer and history of the three basic microprocessor architectures. Section 2 describes the ways in which the architectures react with the

  2. Specification for a reconfigurable optoelectronic VLSI processor suitable for digital signal processing.

    Science.gov (United States)

    Fey, D; Kasche, B; Burkert, C; Tschäche, O

    1998-01-10

    A concept for a parallel digital signal processor based on opticalinterconnections and optoelectronic VLSI circuits is presented. Itis shown that the proper combination of optical communication, architecture, and algorithms allows a throughput that outperformspurely electronic solutions. The usefulness of low-level algorithmsfrom the add-and-shift class is emphasized. These algorithms leadto fine-grain, massively parallel on-chip processor architectures withhigh demands for optical off-chip interconnections. A comparativeperformance analysis shows the superiority of a bit-serialarchitecture. This architecture is mapped onto an optoelectronicthree-dimensional circuit, and the necessary optical interconnectionscheme is specified.

  3. Baseband processor development for the Advanced Communications Satellite Program

    Science.gov (United States)

    Moat, D.; Sabourin, D.; Stilwell, J.; Mccallister, R.; Borota, M.

    1982-01-01

    An onboard-baseband-processor concept for a satellite-switched time-division-multiple-access (SS-TDMA) communication system was developed for NASA Lewis Research Center. The baseband processor routes and controls traffic on an individual message basis while providing significant advantages in improved link margins and system flexibility. Key technology developments required to prove the flight readiness of the baseband-processor design are being verified in a baseband-processor proof-of-concept model. These technology developments include serial MSK modems, Clos-type baseband routing switch, a single-chip CMOS maximum-likelihood convolutional decoder, and custom LSL implementation of high-speed, low-power ECL building blocks.

  4. Compiling quantum algorithms for architectures with multi-qubit gates

    Science.gov (United States)

    Martinez, Esteban A.; Monz, Thomas; Nigg, Daniel; Schindler, Philipp; Blatt, Rainer

    2016-06-01

    In recent years, small-scale quantum information processors have been realized in multiple physical architectures. These systems provide a universal set of gates that allow one to implement any given unitary operation. The decomposition of a particular algorithm into a sequence of these available gates is not unique. Thus, the fidelity of the implementation of an algorithm can be increased by choosing an optimized decomposition into available gates. Here, we present a method to find such a decomposition, where a small-scale ion trap quantum information processor is used as an example. We demonstrate a numerical optimization protocol that minimizes the number of required multi-qubit entangling gates by design. Furthermore, we adapt the method for state preparation, and quantum algorithms including in-sequence measurements.

  5. High-level language computer architecture

    CERN Document Server

    Chu, Yaohan

    1975-01-01

    High-Level Language Computer Architecture offers a tutorial on high-level language computer architecture, including von Neumann architecture and syntax-oriented architecture as well as direct and indirect execution architecture. Design concepts of Japanese-language data processing systems are discussed, along with the architecture of stack machines and the SYMBOL computer system. The conceptual design of a direct high-level language processor is also described.Comprised of seven chapters, this book first presents a classification of high-level language computer architecture according to the pr

  6. Performance Analysis of Embedded Processor%嵌入式处理器的性能分析方法

    Institute of Scientific and Technical Information of China (English)

    盛永杰; 刘博勤; 陈章龙

    2003-01-01

    This paper introduces the architecture of current embedded processors firstly. It focuses on some importantfactors that affect performance of embedded processor ,and proposes some performance analyzing methods. At the lastpart, a generic performance analysis system is illustrated to clarify the idea mentioned before with a short example.

  7. Scientific programming on massively parallel processor CP-PACS

    Energy Technology Data Exchange (ETDEWEB)

    Boku, Taisuke [Tsukuba Univ., Ibaraki (Japan). Inst. of Information Sciences and Electronics

    1998-03-01

    The massively parallel processor CP-PACS takes various problems of calculation physics as the object, and it has been designed so that its architecture has been devised to do various numerical processings. In this report, the outline of the CP-PACS and the example of programming in the Kernel CG benchmark in NAS Parallel Benchmarks, version 1, are shown, and the pseudo vector processing mechanism and the parallel processing tuning of scientific and technical computation utilizing the three-dimensional hyper crossbar net, which are two great features of the architecture of the CP-PACS are described. As for the CP-PACS, the PUs based on RISC processor and added with pseudo vector processor are used. Pseudo vector processing is realized as the loop processing by scalar command. The features of the connection net of PUs are explained. The algorithm of the NPB version 1 Kernel CG is shown. The part that takes the time for processing most in the main loop is the product of matrix and vector (matvec), and the parallel processing of the matvec is explained. The time for the computation by the CPU is determined. As the evaluation of the performance, the evaluation of the time for execution, the short vector processing of pseudo vector processor based on slide window, and the comparison with other parallel computers are reported. (K.I.)

  8. The ATLAS Level-1 Calorimeter Trigger Architecture

    CERN Document Server

    Garvey, J; Mahout, G; Moye, T H; Staley, R J; Watkins, P M; Watson, A T; Achenbach, R; Hanke, P; Kluge, E E; Meier, K; Meshkov, P; Nix, O; Penno, K; Schmitt, K; Ay, Cc; Bauss, B; Dahlhoff, A; Jakobs, K; Mahboubi, K; Schäfer, U; Trefzger, T M; Eisenhandler, E F; Landon, M; Moyse, E; Thomas, J; Apostoglou, P; Barnett, B M; Brawn, I P; Davis, A O; Edwards, J; Gee, C N P; Gillman, A R; Perera, V J O; Qian, W; Bohm, C; Hellman, S; Hidvégi, A; Silverstein, S; RT 2003 13th IEEE-NPSS Real Time Conference

    2004-01-01

    The architecture of the ATLAS Level-1 Calorimeter Trigger system (L1Calo) is presented. Common approaches have been adopted for data distribution, result merging, readout, and slow control across the three different subsystems. A significant amount of common hardware is utilized, yielding substantial savings in cost, spares, and development effort. A custom, high-density backplane has been developed with data paths suitable for both the em/tt cluster processor (CP) and jet/energy-summation processor (JEP) subsystems. Common modules also provide interfaces to VME, CANbus and the LHC Timing, Trigger and Control system (TTC). A common data merger module (CMM) uses FPGAs with multiple configurations for summing electron/photon and tau/hadron cluster multiplicities, jet multiplicities, or total and missing transverse energy. The CMM performs both crate- and system-level merging. A common, FPGA-based readout driver (ROD) is used by all of the subsystems to send input, intermediate and output data to the data acquis...

  9. Fault tolerant, radiation hard, high performance digital signal processor

    Science.gov (United States)

    Holmann, Edgar; Linscott, Ivan R.; Maurer, Michael J.; Tyler, G. L.; Libby, Vibeke

    1990-01-01

    An architecture has been developed for a high-performance VLSI digital signal processor that is highly reliable, fault-tolerant, and radiation-hard. The signal processor, part of a spacecraft receiver designed to support uplink radio science experiments at the outer planets, organizes the connections between redundant arithmetic resources, register files, and memory through a shuffle exchange communication network. The configuration of the network and the state of the processor resources are all under microprogram control, which both maps the resources according to algorithmic needs and reconfigures the processing should a failure occur. In addition, the microprogram is reloadable through the uplink to accommodate changes in the science objectives throughout the course of the mission. The processor will be implemented with silicon compiler tools, and its design will be verified through silicon compilation simulation at all levels from the resources to full functionality. By blending reconfiguration with redundancy the processor implementation is fault-tolerant and reliable, and possesses the long expected lifetime needed for a spacecraft mission to the outer planets.

  10. Dense and Sparse Matrix Operations on the Cell Processor

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel W.; Shalf, John; Oliker, Leonid; Husbands,Parry; Yelick, Katherine

    2005-05-01

    The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. Therefore, the high performance computing community is examining alternative architectures that address the limitations of modern superscalar designs. In this work, we examine STI's forthcoming Cell processor: a novel, low-power architecture that combines a PowerPC core with eight independent SIMD processing units coupled with a software-controlled memory to offer high FLOP/s/Watt. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop an analytic framework to predict Cell performance on dense and sparse matrix operations, using a variety of algorithmic approaches. Results demonstrate Cell's potential to deliver more than an order of magnitude better GFLOP/s per watt performance, when compared with the Intel Itanium2 and Cray X1 processors.

  11. Applications of array processors in the analysis of remote sensing images

    Science.gov (United States)

    Ramapriyan, H. K.; Strong, J. P.

    1984-01-01

    The architectures, programming characteristics, and ranges of application of past, present, and planned array processors for the digital processing of remote-sensing images are compared. Such functions as radiometric and geometric corrections, principal-components analysis, cluster coding, histogram generation, grey-level mapping, convolution, classification, and mensuration and modeling operations are considered, and both pipeline-type and single-instruction/multiple-data-stream (SIMD) arrays are evaluated. Numerical results are presented in a table, and it is found that the pipeline-type arrays normally used with minicomputers increase their speed significantly at low cost, while even further gains are provided by the more expensive SIMD arrays. Most image-processing operations become I/O-limited when SIMD arrays are used with current I/O devices.

  12. LAYERED ARCHITECTURE FOR ASSEMBLING BUSINESS APPLICATIONS FROM DISTRIBUTED COMPONENTS

    Institute of Scientific and Technical Information of China (English)

    Hemant JAIN; Balarama REDDY

    2004-01-01

    Modem business applications are generally characterized as: distributed across many processors and/or sites, access data from multiple sources and have web-based interfaces. These applications may also involve systems or processes from multiple companies or vendor provided services. Thechanging business environment and technologies requires that the application be agile and adoptablein short period. Component based development has recently attracted increased attention as apreferred technology for developing business applications. However, the tools and techniques for thedesign, implementation, management and deployment of applications based on these technologies are at a very early stage of development. This paper presents an overview of a distributed architecture for the deployment of applications based on business components. The application of the architecture in an auto insurance claim domain is briefly described. A number of open research issues have been identified.

  13. Fast Forwarding with Network Processors

    OpenAIRE

    Lefèvre, Laurent; Lemoine, E.; Pham, C; Tourancheau, B.

    2003-01-01

    Forwarding is a mechanism found in many network operations. Although a regular workstation is able to perform forwarding operations it still suffers from poor performances when compared to dedicated hardware machines. In this paper we study the possibility of using Network Processors (NPs) to improve the capability of regular workstations to forward data. We present a simple model and an experimental study demonstrating that even though NPs are less powerful than Host Processors (HPs) they ca...

  14. Floating-point array processors evolve into tailorable systems

    Energy Technology Data Exchange (ETDEWEB)

    Kotelly, G.

    1983-05-12

    Recently introduced 32-bit floating point array processors (APs) combine configuration flexibility, integrated hardware/software system architecture and real-time computational power to meet a variety of application requirements. APS have now evolved into general-purpose boxes and PC boards which readily adapt to changing OEM needs. Contributing to this greater AP versatility are a variety of hardware and software features. These features are described and the range of available products is surveyed.

  15. High speed optical object recognition processor with massive holographic memory

    Science.gov (United States)

    Chao, T.; Zhou, H.; Reyes, G.

    2002-01-01

    Real-time object recognition using a compact grayscale optical correlator will be introduced. A holographic memory module for storing a large bank of optimum correlation filters, to accommodate the large data throughput rate needed for many real-world applications, has also been developed. System architecture of the optical processor and the holographic memory will be presented. Application examples of this object recognition technology will also be demonstrated.

  16. Implementation Of A Prototype Digital Optical Cellular Image Processor (DOCIP)

    Science.gov (United States)

    Huang, K. S.; Sawchuk, A. A.; Jenkins, B. K.; Chavel, P.; Wang, J. M.; Weber, A. G.; Wang, C. H.; Glaser, I.

    1989-02-01

    A processing element of a prototype digital optical cellular image processor (DOCIP) is implemented to demonstrate a particular parallel computing and interconnection architecture. This experimental digital optical computing system consists of a 2-D array of 54 optical logic gates, a 2-D array of 53 subholograms to provide interconnections between gates, and electronic input/output interfaces. The multi-facet interconnection hologram used in this system is fabricated by a computer-controlled optical system to offer very flexible interconnections.

  17. Bilinear Interpolation Image Scaling Processor for VLSI

    Directory of Open Access Journals (Sweden)

    Ms. Pawar Ashwini Dilip

    2014-05-01

    Full Text Available We introduce image scaling processor using VLSI technique. It consist of Bilinear interpolation, clamp filter and a sharpening spatial filter. Bilinear interpolation algorithm is popular due to its computational efficiency and image quality. But resultant image consist of blurring edges and aliasing artifacts after scaling. To reduce the blurring and aliasing artifacts sharpening spatial filter and clamp filters are used as pre-filter. These filters are realized by using T-model and inversed T-model convolution kernels. To reduce the memory buffer and computing resources for proposed image processor design two T-model or inversed T-model filters are combined into combined filter which requires only one line buffer memory. Also, to reduce hardware cost Reconfigurable calculation unit (RCUis invented. The VLSI architecture in this work can achieve 280 MHz with 6.08-K gate counts, and its core area is 30 378 μm2 synthesized by a 0.13-μm CMOS process

  18. Scaling the ion trap quantum processor.

    Science.gov (United States)

    Monroe, C; Kim, J

    2013-03-08

    Trapped atomic ions are standards for quantum information processing, serving as quantum memories, hosts of quantum gates in quantum computers and simulators, and nodes of quantum communication networks. Quantum bits based on trapped ions enjoy a rare combination of attributes: They have exquisite coherence properties, they can be prepared and measured with nearly 100% efficiency, and they are readily entangled with each other through the Coulomb interaction or remote photonic interconnects. The outstanding challenge is the scaling of trapped ions to hundreds or thousands of qubits and beyond, at which scale quantum processors can outperform their classical counterparts in certain applications. We review the latest progress and prospects in that effort, with the promise of advanced architectures and new technologies, such as microfabricated ion traps and integrated photonics.

  19. A microprogrammable versatile processor (MVP) for digital signal and data processing

    Science.gov (United States)

    Tsou, H. E.; Nix, W. C.; Weir, E. M.; Smith, J. J.; Kushner, M. L.

    A simple, functionally partitioned, data bus-oriented, horizontally-microprogrammed processor architecture has been developed which is ideal for low cost signal processing, emulation, and data processing applications. Due to its functional modularity and bus oriented architecture, new technologies can be inserted without perturbing the design and the support software. The processor has separate RALUs for data processing and address generation and is the first signal processor to use the TRW 16 x 16-bit multiplier and the AMD RALUs. It allows for growth by adding attached arithmetic processors, microprocessors, etc. Support software includes register transfer-level HOL, assembler, loader, linker, diagnostics, library, executive and I/O drivers. Applications include speech processing, coding/decoding, modulation/demodulation, signal processing and emulation.

  20. Real-Time Cognitive Computing Architecture for Data Fusion in a Dynamic Environment

    Science.gov (United States)

    Duong, Tuan A.; Duong, Vu A.

    2012-01-01

    A novel cognitive computing architecture is conceptualized for processing multiple channels of multi-modal sensory data streams simultaneously, and fusing the information in real time to generate intelligent reaction sequences. This unique architecture is capable of assimilating parallel data streams that could be analog, digital, synchronous/asynchronous, and could be programmed to act as a knowledge synthesizer and/or an "intelligent perception" processor. In this architecture, the bio-inspired models of visual pathway and olfactory receptor processing are combined as processing components, to achieve the composite function of "searching for a source of food while avoiding the predator." The architecture is particularly suited for scene analysis from visual data and odorant.

  1. Simple and cost-effective fabrication of size-tunable zinc oxide architectures by multiple size reduction technique

    Directory of Open Access Journals (Sweden)

    Hyeong-Ho Park, Xin Zhang, Seon-Yong Hwang, Sang Hyun Jung, Semin Kang, Hyun-Beom Shin, Ho Kwan Kang, Hyung-Ho Park, Ross H Hill and Chul Ki Ko

    2012-01-01

    Full Text Available We present a simple size reduction technique for fabricating 400 nm zinc oxide (ZnO architectures using a silicon master containing only microscale architectures. In this approach, the overall fabrication, from the master to the molds and the final ZnO architectures, features cost-effective UV photolithography, instead of electron beam lithography or deep-UV photolithography. A photosensitive Zn-containing sol–gel precursor was used to imprint architectures by direct UV-assisted nanoimprint lithography (UV-NIL. The resulting Zn-containing architectures were then converted to ZnO architectures with reduced feature sizes by thermal annealing at 400 °C for 1 h. The imprinted and annealed ZnO architectures were also used as new masters for the size reduction technique. ZnO pillars of 400 nm diameter were obtained from a silicon master with pillars of 1000 nm diameter by simply repeating the size reduction technique. The photosensitivity and contrast of the Zn-containing precursor were measured as 6.5 J cm−2 and 16.5, respectively. Interesting complex ZnO patterns, with both microscale pillars and nanoscale holes, were demonstrated by the combination of dose-controlled UV exposure and a two-step UV-NIL.

  2. Neural network surface acoustic wave RF signal processor for digital modulation recognition.

    Science.gov (United States)

    Kavalov, Dimitar; Kalinin, Victor

    2002-09-01

    An architecture of a surface acoustic wave (SAW) processor based on an artificial neural network is proposed for an automatic recognition of different types of digital passband modulation. Three feed-forward networks are trained to recognize filtered and unfiltered binary phase shift keying (BPSK) and quadrature phase shift keying (QPSK) signals, as well as unfiltered BPSK, QPSK, and 16 quadrature amplitude (16QAM) signals. Performance of the processor in the presence of additive white Gaussian noise (AWGN) is simulated. The influence of second-order effects in SAW devices, phase, and amplitude errors on the performance of the processor also is studied.

  3. Java Processor Optimized for RTSJ

    Directory of Open Access Journals (Sweden)

    Tu Shiliang

    2007-01-01

    Full Text Available Due to the preeminent work of the real-time specification for Java (RTSJ, Java is increasingly expected to become the leading programming language in real-time systems. To provide a Java platform suitable for real-time applications, a Java processor which can execute Java bytecode is directly proposed in this paper. It provides efficient support in hardware for some mechanisms specified in the RTSJ and offers a simpler programming model through ameliorating the scoped memory of the RTSJ. The worst case execution time (WCET of the bytecodes implemented in this processor is predictable by employing the optimization method proposed in our previous work, in which all the processing interfering predictability is handled before bytecode execution. Further advantage of this method is to make the implementation of the processor simpler and suited to a low-cost FPGA chip.

  4. Effective Utilization of Multicore Processor for Unified Threat Management Functions

    Directory of Open Access Journals (Sweden)

    Radhakrishnan Shanmugasundaram

    2012-01-01

    Full Text Available Problem statement: Multicore and multithreaded CPUs have become the new approach for increase in the performance of the processor based systems. Numerous applications benefit from use of multiple cores. Unified threat management is one such application that has multiple functions to be implemented at high speeds. Increasing performance of the system by knowing the nature of the functionality and effective utilization of multiple processors for each of the functions warrants detailed experimentation. In this study, some of the functions of Unified Threat Management are implemented using multiple processors for each of the functions. Approach: This evaluation was conducted on SunfireT1000 server having Sun Ultras ARC T1 multicore processor. OpenMP parallelization methods are used for scheduling the logical CPUs for the parallelized application. Results: Execution time for some of the UTM functions implemented was analyzed to arrive at an effective allocation and parallelization methodology that is dependent on the hardware and the workload. Conclusion/Recommendations: Based on the analysis, the type of parallelization method for the implemented UTM functions are suggested.

  5. Fast 2D-DCT implementations for VLIW processors

    OpenAIRE

    Sohm, OP; Canagarajah, CN; Bull, DR

    1999-01-01

    This paper analyzes various fast 2D-DCT algorithms regarding their suitability for VLIW processors. Operations for truncation or rounding which are usually neglected in proposals for fast algorithms have also been taken into consideration. Loeffler's algorithm with parallel multiplications was found to be most suitable due to its parallel structure

  6. List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor

    Science.gov (United States)

    Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.

    2014-03-01

    List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging

  7. VLSI Design of a Variable-Length FFT/IFFT Processor for OFDM-Based Communication Systems

    Directory of Open Access Journals (Sweden)

    Jen-Chih Kuo

    2003-12-01

    Full Text Available The technique of {orthogonal frequency division multiplexing (OFDM} is famous for its robustness against frequency-selective fading channel. This technique has been widely used in many wired and wireless communication systems. In general, the {fast Fourier transform (FFT} and {inverse FFT (IFFT} operations are used as the modulation/demodulation kernel in the OFDM systems, and the sizes of FFT/IFFT operations are varied in different applications of OFDM systems. In this paper, we design and implement a variable-length prototype FFT/IFFT processor to cover different specifications of OFDM applications. The cached-memory FFT architecture is our suggested VLSI system architecture to design the prototype FFT/IFFT processor for the consideration of low-power consumption. We also implement the twiddle factor butterfly {processing element (PE} based on the {{coordinate} rotation digital computer (CORDIC} algorithm, which avoids the use of conventional multiplication-and-accumulation unit, but evaluates the trigonometric functions using only add-and-shift operations. Finally, we implement a variable-length prototype FFT/IFFT processor with TSMC 0.35 μm 1P4M CMOS technology. The simulations results show that the chip can perform (64-2048-point FFT/IFFT operations up to 80 MHz operating frequency which can meet the speed requirement of most OFDM standards such as WLAN, ADSL, VDSL (256∼2K, DAB, and 2K-mode DVB.

  8. Cluster Algorithm Special Purpose Processor

    Science.gov (United States)

    Talapov, A. L.; Shchur, L. N.; Andreichenko, V. B.; Dotsenko, Vl. S.

    We describe a Special Purpose Processor, realizing the Wolff algorithm in hardware, which is fast enough to study the critical behaviour of 2D Ising-like systems containing more than one million spins. The processor has been checked to produce correct results for a pure Ising model and for Ising model with random bonds. Its data also agree with the Nishimori exact results for spin glass. Only minor changes of the SPP design are necessary to increase the dimensionality and to take into account more complex systems such as Potts models.

  9. Cluster algorithm special purpose processor

    Energy Technology Data Exchange (ETDEWEB)

    Talapov, A.L.; Shchur, L.N.; Andreichenko, V.B.; Dotsenko, V.S. (Landau Inst. for Theoretical Physics, GSP-1 117940 Moscow V-334 (USSR))

    1992-08-10

    In this paper, the authors describe a Special Purpose Processor, realizing the Wolff algorithm in hardware, which is fast enough to study the critical behaviour of 2D Ising-like systems containing more than one million spins. The processor has been checked to produce correct results for a pure Ising model and for Ising model with random bonds. Its data also agree with the Nishimori exact results for spin glass. Only minor changes of the SPP design are necessary to increase the dimensionality and to take into account more complex systems such as Potts models.

  10. Associative Memory Design for the FastTrack Processor (FTK) at ATLAS

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Volpi, G; Beccherle, R; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Amerio, S; Hoff, J; Liu, T; Sacco, I; Liberali, V; Stabile, A; Schoening, A; Soltveit, H; Tripiccione, R

    2011-01-01

    We propose a new generation of VLSI processor for pattern recognition based on Associative Memory architecture, optimized for on-line track finding in high-energy physics experiments. We describe the architecture, the technology studies and the prototype design of a new Associative Memory project: it maximizes the pattern density on ASICs, minimizes the power consumption and improves the functionality for the fast tracker processor proposed to upgrade the ATLAS trigger at LHC. Finally we will focus on possible future applications inside and outside high physics energy.

  11. General Architecture and Instruction Set Enhancements for Multimedia Applications

    Directory of Open Access Journals (Sweden)

    Mansour Assaf

    2007-12-01

    Full Text Available The present day multimedia applications (MMAs are driving the computing industry as every application being developed is using multimedia in one or the other way. Computer architects are building computer systems with powerful processors to handle the MMAs. There have been tremendous changes in the design of the processors to handle different types of MMAs. We see a lot of such application specific processors today in the industry; different architectures have been proposed for processing MMAs such as VLIW, superscalar (general-purpose processor enhanced with a multimedia extension such as MMX, vector architecture, SIMD architectures, and reconfigurable computing devices. Many of the General Purpose Processors (GPPs require coprocessors to handle graphics and sound and usually those processors are either expensive or incompatible. Keeping these and the demands MMAs in mind designers have made changes to GPPs; many GPP Vendors have added instructions to their Instruction Set Architecture (ISA. All these processors use similar techniques to execute multimedia instructions. This survey paper investigates the enhancements made to the GPPS in their general Architecture as well as the ISA. We will present the many different techniques used by GPP designers to handle MMAs, the present day GPP available architectures, compare different techniques, and concludes this survey.

  12. Wavelength-encoded OCDMA system using opto-VLSI processors.

    Science.gov (United States)

    Aljada, Muhsen; Alameh, Kamal

    2007-07-01

    We propose and experimentally demonstrate a 2.5 Gbits/sper user wavelength-encoded optical code-division multiple-access encoder-decoder structure based on opto-VLSI processing. Each encoder and decoder is constructed using a single 1D opto-very-large-scale-integrated (VLSI) processor in conjunction with a fiber Bragg grating (FBG) array of different Bragg wavelengths. The FBG array spectrally and temporally slices the broadband input pulse into several components and the opto-VLSI processor generates codewords using digital phase holograms. System performance is measured in terms of the autocorrelation and cross-correlation functions as well as the eye diagram.

  13. A processor sharing model for wireless data communication

    DEFF Research Database (Denmark)

    Hansen, Martin Bøgsted

    occupies these servers for an exponentially distributed holding time with mean $1/( mu)$. However, in lack of requested resources some Time Division Multiple Access (TDMA) implementations for mobile data communication like High Speed Circuit Switched Data (HSCSD) and General Packet Radio Service (GPRS......) allow already established resources for data connections to be downgraded to allow a new connection to be established. As noted by Litjens and Boucherie (2002) this resembles classical processor sharing models, and in this spirit we formulate a variant of the processor sharing model with a limited...

  14. The hardware track finder processor in CMS at CERN

    CERN Document Server

    Kluge, A

    1997-01-01

    The work covers the design of the Track Finder Processor in the high energy experiment CMS (Compact Muon Solenoid, planned for 2005) at CERN/Geneva. The task of this processor is to identify muons and measure their transverse momentum. The track finder processor makes it possible to determine the physical relevance of each high energetic collision and to forward only interesting data to the data an alysis units. Data of more than two hundred thousand detector cells are used to determine the location of muons and measure their transverse momentum. Each 25 ns a new data set is generated. Measurem ent of location and transverse momentum of the muons can be terminated within 350 ns by using an ASIC (Application Specific Integrated Circuit). A pipeline architecture processes new data sets with th e required data rate of 40 MHz to ensure dead time free operation. In the framework of this study specifications and the overall concept of the track finder processor were worked out in detail. Simul ations were performed...

  15. A CASE FOR HYBRID INSTRUCTION ENCODING FOR REDUCING CODE SIZE IN EMBEDDED SYSTEM-ON-CHIPS BASED ON RISC PROCESSOR CORES

    Directory of Open Access Journals (Sweden)

    Govindarajalu Bakthavatsalam

    2014-01-01

    Full Text Available Embedded computing differs from general purpose computing in several aspects. In most embedded systems, size, cost and power consumption are more important than performance. In embedded System-on-Chips (SoC, memory is a scarce resource and it poses constraints on chip space, cost and power consumption. Whereas fixed instruction length feature of RISC architecture simplifies instruction decoding and pipeline implementation, its undesirable side effect is code size increase caused by large number of unused bits. Code size reduction minimizes memory size, chip space and power consumption all of which are significant for low power portable embedded systems. Though code size reduction has drawn the attention of architects and developers, the solutions currently used are more of cure than of prevention. Considering the huge number of embedded applications, there is a need for a dedicated processor optimized for low power and portable embedded systems. In the study, we propose a variation of Hybrid Instruction Encoding (HIE for the embedded processors. Our scheme uses fixed number of multiple instruction lengths with provision for hybrid sizes for the offset and the immediate fields thereby reducing the number of unused bits. We simulated the HIE for the MIPS32 processors and measured code sizes of various embedded applications of MiBench and MediaBench benchmarks using an offline tool developed newly. We noticed up to 27% code reduction for large and medium sized embedded applications respectively. This results in reduction of on-chip memory capacity up to 1 mega bytes that is very significant for SoC based embedded applications. Considering the large market share of embedded systems, it is worth investing in a new architecture and development of dedicated HIE-RISC processor cores for portable embedded systems based on SoCs.

  16. Performance-Optimum Superscalar Architecture for Embedded Applications

    CERN Document Server

    Alipour, Mehdi

    2012-01-01

    Embedded applications are widely used in portable devices such as wireless phones, personal digital assistants, laptops, etc. High throughput and real time requirements are especially important in such data-intensive tasks. Therefore, architectures that provide the required performance are the most desirable. On the other hand, processor performance is severely related to the average memory access delay, number of processor registers and also size of the instruction window and superscalar parameters. Therefore, cache, register file and superscalar parameters are the major architectural concerns in designing a superscalar architecture for embedded processors. Although increasing cache and register file size leads to performance improvements in high performance embedded processors, the increased area, power consumption and memory delay are the overheads of these techniques. This paper explores the effect of cache, register file and superscalar parameters on the processor performance to specify the optimum size ...

  17. Novel wavelength division multiplex-radio over fiber-passive optical network architecture for multiple access points based on multitone generation and triple sextupling frequency

    Science.gov (United States)

    Cheng, Guangming; Guo, Banghong; Liu, Songhao; Huang, Xuguang

    2014-01-01

    An innovative wavelength division multiplex-radio over fiber-passive optical network architecture for multiple access points (AP) based on multitone generation and triple sextupling frequency is proposed and demonstrated. A dual-drive Mach-Zehnder modulator (DD-MZM) is utilized to realize the multitone generation. Even sidebands are suppressed to make the adjacent frequency separation twice the frequency of the local oscillator by adjusting the modulation voltage of the DD-MZM. Due to adopting three fiber Bragg gratings to reflect the unmodulated sidebands for uplink communications source free at optical network unit (ONU), is achieved. The system can support at least three APs at one ONU simultaneously with a 30 km single-mode fiber (SMF) transmission and 5 Gb/s data rate both for uplink and downlink communications. The theoretical analysis and simulation results show the architecture has an excellent performance and will be a promising candidate in future hybrid access networks.

  18. Fast, Massively Parallel Data Processors

    Science.gov (United States)

    Heaton, Robert A.; Blevins, Donald W.; Davis, ED

    1994-01-01

    Proposed fast, massively parallel data processor contains 8x16 array of processing elements with efficient interconnection scheme and options for flexible local control. Processing elements communicate with each other on "X" interconnection grid with external memory via high-capacity input/output bus. This approach to conditional operation nearly doubles speed of various arithmetic operations.

  19. ASSP Advanced Sensor Signal Processor.

    Science.gov (United States)

    1984-06-01

    transfer data sad cimeds . When a Processor receives the required data (Image) md/or oamand, that data will be operated on B-3 I I I autonomouly. The...BAN is provided by two separately controled DMA address generator chips (Am29o40). Each of these DMA chips create an 8 bit address. One DMA chip gene

  20. The DPGA for Conbining the Superscalar and Multithreaded Processors Principal

    Institute of Scientific and Technical Information of China (English)

    2001-01-01

    The performance of scalable shared-memory multiprocessors suffers from three types of latency; memory latency, the latency caused by inter-process synchroni z ation ,and the latency caused by instructions that take multiple cycles to produ ce results. To tolerate these three types of latencies, The followin g techniques was proposed to couple: coarse-grained multithreading, the supersc alar processor and a rec onfigurable device, namely the overlapping long latency operations of one thread of computation with the execution of other threads. The superscalar processor p rinciple is used to tolerate instruction latency by issuing several instructions simultaneously. The DPGA is coupled with this processor in order to improve th e context-switching overhead.

  1. Processor Allocation for Optimistic Parallelization of Irregular Programs

    CERN Document Server

    Versaci, Francesco

    2012-01-01

    Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution of the irregular algorithm. Therefore, the processor allocation problem for irregular algorithms is very difficult. In this paper, we describe the first systematic strategy for addressing this pro...

  2. Cassava processors' awareness of occupational and environmental ...

    African Journals Online (AJOL)

    Cassava processors' awareness of occupational and environmental hazards ... Majority of the respondents also complained of lack of water (78.4%), lack of ... so as to reduce the problems faced by cassava processors during processing.

  3. Architecture on Architecture

    DEFF Research Database (Denmark)

    Olesen, Karen

    2016-01-01

    This paper will discuss the challenges faced by architectural education today. It takes as its starting point the double commitment of any school of architecture: on the one hand the task of preserving the particular knowledge that belongs to the discipline of architecture, and on the other hand...... that is not scientific or academic but is more like a latent body of data that we find embedded in existing works of architecture. This information, it is argued, is not limited by the historical context of the work. It can be thought of as a virtual capacity – a reservoir of spatial configurations that can...... the autonomy of architecture, not as an esoteric concept but as a valid source of information in a pragmatic design practice, may help us overcome the often-proclaimed dichotomy between formal autonomy and a societally committed architecture. It follows that in architectural education there can be a close...

  4. Digital design and computer architecture

    CERN Document Server

    Harris, David

    2010-01-01

    Digital Design and Computer Architecture is designed for courses that combine digital logic design with computer organization/architecture or that teach these subjects as a two-course sequence. Digital Design and Computer Architecture begins with a modern approach by rigorously covering the fundamentals of digital logic design and then introducing Hardware Description Languages (HDLs). Featuring examples of the two most widely-used HDLs, VHDL and Verilog, the first half of the text prepares the reader for what follows in the second: the design of a MIPS Processor. By the end of D

  5. Broadband monitoring simulation with massively parallel processors

    Science.gov (United States)

    Trubetskov, Mikhail; Amotchkina, Tatiana; Tikhonravov, Alexander

    2011-09-01

    Modern efficient optimization techniques, namely needle optimization and gradual evolution, enable one to design optical coatings of any type. Even more, these techniques allow obtaining multiple solutions with close spectral characteristics. It is important, therefore, to develop software tools that can allow one to choose a practically optimal solution from a wide variety of possible theoretical designs. A practically optimal solution provides the highest production yield when optical coating is manufactured. Computational manufacturing is a low-cost tool for choosing a practically optimal solution. The theory of probability predicts that reliable production yield estimations require many hundreds or even thousands of computational manufacturing experiments. As a result reliable estimation of the production yield may require too much computational time. The most time-consuming operation is calculation of the discrepancy function used by a broadband monitoring algorithm. This function is formed by a sum of terms over wavelength grid. These terms can be computed simultaneously in different threads of computations which opens great opportunities for parallelization of computations. Multi-core and multi-processor systems can provide accelerations up to several times. Additional potential for further acceleration of computations is connected with using Graphics Processing Units (GPU). A modern GPU consists of hundreds of massively parallel processors and is capable to perform floating-point operations efficiently.

  6. QPACE -- a QCD parallel computer based on Cell processors

    CERN Document Server

    Baier, H; Drochner, M; Eicker, N; Fischer, U; Fodor, Z; Frommer, A; Gomez, C; Goldrian, G; Heybrock, S; Hierl, D; Hüsken, M; Huth, T; Krill, B; Lauritsen, J; Lippert, T; Maurer, T; Meyer, N; Nobile, A; Ouda, I; Pivanti, M; Pleiter, D; Schäfer, A; Schick, H; Schifano, F; Simma, H; Solbrig, S; Streuer, T; Sulanke, K -H; Tripiccione, R; Vogt, J -S; Wettig, T; Winter, F

    2009-01-01

    QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.

  7. Programming massively parallel processors a hands-on approach

    CERN Document Server

    Kirk, David B

    2010-01-01

    Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs. Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA. The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who ...

  8. ASIC Design of Floating-Point FFT Processor

    Institute of Scientific and Technical Information of China (English)

    陈禾; 赵忠武

    2004-01-01

    An application specific integrated circuit (ASIC) design of a 1024 points floating-point fast Fourier transform(FFT) processor is presented. It can satisfy the requirement of high accuracy FFT result in related fields. Several novel design techniques for floating-point adder and multiplier are introduced in detail to enhance the speed of the system. At the same time, the power consumption is decreased. The hardware area is effectively reduced as an improved butterfly processor is developed. There is a substantial increase in the performance of the design since a pipelined architecture is adopted, and very large scale integrated (VLSI) is easy to realize due to the regularity. A result of validation using field programmable gate array (FPGA) is shown at the end. When the system clock is set to 50 MHz, 204.8 μs is needed to complete the operation of FFT computation.

  9. JIST: Just-In-Time Scheduling Translation for Parallel Processors

    Directory of Open Access Journals (Sweden)

    Giovanni Agosta

    2005-01-01

    Full Text Available The application fields of bytecode virtual machines and VLIW processors overlap in the area of embedded and mobile systems, where the two technologies offer different benefits, namely high code portability, low power consumption and reduced hardware cost. Dynamic compilation makes it possible to bridge the gap between the two technologies, but special attention must be paid to software instruction scheduling, a must for the VLIW architectures. We have implemented JIST, a Virtual Machine and JIT compiler for Java Bytecode targeted to a VLIW processor. We show the impact of various optimizations on the performance of code compiled with JIST through the experimental study on a set of benchmark programs. We report significant speedups, and increments in the number of instructions issued per cycle up to 50% with respect to the non-scheduling version of the JITcompiler. Further optimizations are discussed.

  10. Investigating the Performance of an Adiabatic Quantum Optimization Processor

    CERN Document Server

    Rose, Geordie; Dickson, Neil G; Hamze, Firas; Amin, M H S; Drew-Brook, Marshall; Chudak, Fabian A; Bunyk, Paul I; Macready, William G

    2010-01-01

    We calculate median adiabatic times (in seconds) of a specific superconducting adiabatic quantum processor for an NP-hard Ising spin glass instance class with up to N=128 binary variables. To do so, we ran high performance Quantum Monte Carlo simulations on a large-scale Internet-based computing platform. We compare the median adiabatic times with the median running times of two classical solvers and find that, for problems with up to 128 variables, the adiabatic times for the simulated processor architecture are about 4 and 6 orders of magnitude shorter than the two classical solvers' times. This performance difference shows that, even in the potential absence of a scaling advantage, adiabatic quantum optimization may outperform classical solvers.

  11. Image processing algorithm acceleration using reconfigurable macro processor model

    Institute of Scientific and Technical Information of China (English)

    孙广富; 陈华明; 卢焕章

    2004-01-01

    The concept and advantage of reconfigurable technology is introduced. A kind of processor architecture of reconfigurable macro processor (RMP) model based on FPGA array and DSP is put forward and has been implemented.Two image algorithms are developed: template-based automatic target recognition and zone labeling. One is estimating for motion direction in the infrared image background, another is line picking-up algorithm based on image zone labeling and phase grouping technique. It is a kind of "hardware" function that can be called by the DSP in high-level algorithm.It is also a kind of hardware algorithm of the DSP. The results of experiments show the reconfigurable computing technology based on RMP is an ideal accelerating means to deal with the high-speed image processing tasks. High real time performance is obtained in our two applications on RMP.

  12. In-Network Adaptation of Video Streams Using Network Processors

    Directory of Open Access Journals (Sweden)

    Mohammad Shorfuzzaman

    2009-01-01

    problem can be addressed, near the network edge, by applying dynamic, in-network adaptation (e.g., transcoding of video streams to meet available connection bandwidth, machine characteristics, and client preferences. In this paper, we extrapolate from earlier work of Shorfuzzaman et al. 2006 in which we implemented and assessed an MPEG-1 transcoding system on the Intel IXP1200 network processor to consider the feasibility of in-network transcoding for other video formats and network processor architectures. The use of “on-the-fly” video adaptation near the edge of the network offers the promise of simpler support for a wide range of end devices with different display, and so forth, characteristics that can be used in different types of environments.

  13. 40 CFR 791.45 - Processors.

    Science.gov (United States)

    2010-07-01

    ... 40 Protection of Environment 31 2010-07-01 2010-07-01 true Processors. 791.45 Section 791.45 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) TOXIC SUBSTANCES CONTROL ACT (CONTINUED) DATA REIMBURSEMENT Basis for Proposed Order § 791.45 Processors. (a) Generally, processors will be...

  14. FPGA Based High Speed SPA Resistant Elliptic Curve Scalar Multiplier Architecture

    Directory of Open Access Journals (Sweden)

    Khalid Javeed

    2016-01-01

    Full Text Available The higher computational complexity of an elliptic curve scalar point multiplication operation limits its implementation on general purpose processors. Dedicated hardware architectures are essential to reduce the computational time, which results in a substantial increase in the performance of associated cryptographic protocols. This paper presents a unified architecture to compute modular addition, subtraction, and multiplication operations over a finite field of large prime characteristic GF(p. Subsequently, dual instances of the unified architecture are utilized in the design of high speed elliptic curve scalar multiplier architecture. The proposed architecture is synthesized and implemented on several different Xilinx FPGA platforms for different field sizes. The proposed design computes a 192-bit elliptic curve scalar multiplication in 2.3 ms on Virtex-4 FPGA platform. It is 34% faster and requires 40% fewer clock cycles for elliptic curve scalar multiplication and consumes considerable fewer FPGA slices as compared to the other existing designs. The proposed design is also resistant to the timing and simple power analysis (SPA attacks; therefore it is a good choice in the construction of fast and secure elliptic curve based cryptographic protocols.

  15. An Optimised Distributed Arithmetic Architecture for 8×8 DTT

    Directory of Open Access Journals (Sweden)

    Ranjan K. Senapati

    2015-08-01

    Full Text Available Discrete Tchebichef Transform (DTT is an orthogonal transform and is used in many applications like image and video compression, feature extraction, artefact analysis, blind integrity verification and pattern recognition. In comparison with DCT, DTT has better image reconstruction quality for certain class of images. Direct implementation of DTT requires large number of multiplications, which are time-consuming and expensive in a simple processor. To perform in real time, these large number of operations can be completely avoided in our proposed architecture. The proposed architecture uses distributed (DA based technique which offers high speed and small area. The basic architecture consists of one dimensional (1D row DTT followed by a transpose register array and another 1D column DTT. The 1D DTT structure only requires 15 adders to build a compressed adder matrix and is also ROM free. Compared with DCT architecture, the proposed architecture shows an improvement in speed and reduction in area by 5% on a Xilinx vertex-4 FPGA platform.

  16. Raexplore: Enabling Rapid, Automated Architecture Exploration for Full Applications

    Energy Technology Data Exchange (ETDEWEB)

    Zhang, Yao [Argonne National Lab. (ANL), Argonne, IL (United States); Balaprakash, Prasanna [Argonne National Lab. (ANL), Argonne, IL (United States); Meng, Jiayuan [Argonne National Lab. (ANL), Argonne, IL (United States); Morozov, Vitali [Argonne National Lab. (ANL), Argonne, IL (United States); Parker, Scott [Argonne National Lab. (ANL), Argonne, IL (United States); Kumaran, Kalyan [Argonne National Lab. (ANL), Argonne, IL (United States)

    2014-12-01

    We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list of architectural scaling options for future processor designs.

  17. Highly scalable digital front end architectures for digital printing

    Science.gov (United States)

    Staas, David

    2011-01-01

    HP's digital printing presses consume a tremendous amount of data. The architectures of the Digital Front Ends (DFEs) that feed these large, very fast presses have evolved from basic, single-RIP (Raster Image Processor) systems to multirack, distributed systems that can take a PDF file and deliver data in excess of 3 Gigapixels per second to keep the presses printing at 2000+ pages per minute. This paper highlights some of the more interesting parallelism features of our DFE architectures. The high-performance architecture developed over the last 5+ years can scale up to HP's largest digital press, out to multiple mid-range presses, and down into a very low-cost single box deployment for low-end devices as appropriate. Principles of parallelism pervade every aspect of the architecture, from the lowest-level elements of jobs to parallel imaging pipelines that feed multiple presses. From cores to threads to arrays to network teams to distributed machines, we use a systematic approach to move bottlenecks. The ultimate goals of these efforts are: to take the best advantage of the prevailing hardware options at our disposal; to reduce power consumption and cooling requirements; and to ultimately reduce the cost of the solution to our customers.

  18. VLSI based FFT Processor with Improvement in Computation Speed and Area Reduction

    Directory of Open Access Journals (Sweden)

    M.Sheik Mohamed

    2013-06-01

    Full Text Available In this paper, a modular approach is presented to develop parallel pipelined architectures for the fast Fourier transform (FFT processor. The new pipelined FFT architecture has the advantage of underutilized hardware based on the complex conjugate of final stage results without increasing the hardware complexity. The operating frequency of the new architecture can be decreased that in turn reduces the power consumption. A comparison of area and computing time are drawn between the new design and the previous architectures. The new structure is synthesized using Xilinx ISE and simulated using ModelSim Starter Edition. The designed FFT algorithm is realized in our processor to reduce the number of complex computations.

  19. How the Navy Can Use Open Systems Architecture to Revolutionize Capability Acquisition: The Naval OSA Strategy Can Yield Multiple Benefits

    Science.gov (United States)

    2015-04-30

    Åèìáëáíáçå=oÉëÉ~êÅÜ=mêçÖê~ãW= `êÉ~íáåÖ=póåÉêÖó=Ñçê=fåÑçêãÉÇ=`Ü~åÖÉ= - 108 - that can be (1) defined and managed by government/ industry consortia and...consolidation of platform-unique architectures into an open product-line architecture defined and managed by government/ industry consortia that will...a holistic OSA approach are government/ industry consortia , which help ^Åèìáëáíáçå=oÉëÉ~êÅÜ=mêçÖê~ãW= `êÉ~íáåÖ=póåÉêÖó=Ñçê=fåÑçêãÉÇ=`Ü~åÖÉ= - 116

  20. Exploring multiple feature combination strategies with a recurrent neural network architecture for off-line handwriting recognition

    Science.gov (United States)

    Mioulet, L.; Bideault, G.; Chatelain, C.; Paquet, T.; Brunessaux, S.

    2015-01-01

    The BLSTM-CTC is a novel recurrent neural network architecture that has outperformed previous state of the art algorithms in tasks such as speech recognition or handwriting recognition. It has the ability to process long term dependencies in temporal signals in order to label unsegmented data. This paper describes different ways of combining features using a BLSTM-CTC architecture. Not only do we explore the low level combination (feature space combination) but we also explore high level combination (decoding combination) and mid-level (internal system representation combination). The results are compared on the RIMES word database. Our results show that the low level combination works best, thanks to the powerful data modeling of the LSTM neurons.

  1. Real-time optical processor prototype for remote SAR applications

    Science.gov (United States)

    Marchese, Linda; Doucet, Michel; Harnisch, Bernd; Suess, Martin; Bourqui, Pascal; Legros, Mathieu; Desnoyers, Nichola; Guillot, Ludovic; Mercier, Luc; Savard, Maxime; Martel, Anne; Châteauneuf, François; Bergeron, Alain

    2009-09-01

    A Compact Real-Time Optical SAR Processor has been successfully developed and tested. SAR, or Synthetic Aperture Radar, is a powerful tool providing enhanced day and night imaging capabilities. SAR systems typically generate large amounts of information generally in the form of complex data that are difficult to compress. Specifically, for planetary missions and unmanned aerial vehicle (UAV) systems with limited communication data rates this is a clear disadvantage. SAR images are typically processed electronically applying dedicated Fourier transformations. This, however, can also be performed optically in real-time. Indeed, the first SAR images have been optically processed. The optical processor architecture provides inherent parallel computing capabilities that can be used advantageously for the SAR data processing. Onboard SAR image generation would provide local access to processed information paving the way for real-time decision-making. This could eventually benefit navigation strategy and instrument orientation decisions. Moreover, for interplanetary missions, onboard analysis of images could provide important feature identification clues and could help select the appropriate images to be transmitted to Earth, consequently helping bandwidth management. This could ultimately reduce the data throughput requirements and related transmission bandwidth. This paper reviews the design of a compact optical SAR processor prototype that would reduce power, weight, and size requirements and reviews the analysis of SAR image generation using the table-top optical processor. Various SAR processor parameters such as processing capabilities, image quality (point target analysis), weight and size are reviewed. Results of image generation from simulated point targets as well as real satellite-acquired raw data are presented.

  2. Analysis of the computational requirements of a pulse-doppler radar signal processor

    CSIR Research Space (South Africa)

    Broich, R

    2012-05-01

    Full Text Available architectures [1]. These simplifications are often degrading to algorithmic performance and thus to the entire radar system. In this paper the different computational operations that are used in pulse-Doppler radar signal processing are explored, in order...H z to 10 GH z Fig. 1. Radar signal processor (RSP) flow of operations purpose computer architectures [3]. An abstract machine, in which only memory reads, writes, additions and multiplica- tions are considered to be significant operations...

  3. An Experimental Digital Image Processor

    Science.gov (United States)

    Cok, Ronald S.

    1986-12-01

    A prototype digital image processor for enhancing photographic images has been built in the Research Laboratories at Kodak. This image processor implements a particular version of each of the following algorithms: photographic grain and noise removal, edge sharpening, multidimensional image-segmentation, image-tone reproduction adjustment, and image-color saturation adjustment. All processing, except for segmentation and analysis, is performed by massively parallel and pipelined special-purpose hardware. This hardware runs at 10 MHz and can be adjusted to handle any size digital image. The segmentation circuits run at 30 MHz. The segmentation data are used by three single-board computers for calculating the tonescale adjustment curves. The system, as a whole, has the capability of completely processing 10 million three-color pixels per second. The grain removal and edge enhancement algorithms represent the largest part of the pipelined hardware, operating at over 8 billion integer operations per second. The edge enhancement is performed by unsharp masking, and the grain removal is done using a collapsed Walsh-hadamard transform filtering technique (U.S. Patent No. 4549212). These two algo-rithms can be realized using four basic processing elements, some of which have been imple-mented as VLSI semicustom integrated circuits. These circuits implement the algorithms with a high degree of efficiency, modularity, and testability. The digital processor is controlled by a Digital Equipment Corporation (DEC) PDP 11 minicomputer and can be interfaced to electronic printing and/or electronic scanning de-vices. The processor has been used to process over a thousand diagnostic images.

  4. Challenges of Algebraic Multigrid across Multicore Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Baker, A H; Gamblin, T; Schulz, M; Yang, U M

    2010-04-12

    Algebraic multigrid (AMG) is a popular solver for large-scale scientific computing and an essential component of many simulation codes. AMG has shown to be extremely efficient on distributed-memory architectures. However, when executed on modern multicore architectures, we face new challenges that can significantly deteriorate AMG's performance. We examine its performance and scalability on three disparate multicore architectures: a cluster with four AMD Opteron Quad-core processors per node (Hera), a Cray XT5 with two AMD Opteron Hex-core processors per node (Jaguar), and an IBM BlueGene/P system with a single Quad-core processor (Intrepid). We discuss our experiences on these platforms and present results using both an MPI-only and a hybrid MPI/OpenMP model. We also discuss a set of techniques that helped to overcome the associated problems, including thread and process pinning and correct memory associations.

  5. A wearable real-time image processor for a vision prosthesis.

    Science.gov (United States)

    Tsai, D; Morley, J W; Suaning, G J; Lovell, N H

    2009-09-01

    Rapid progress in recent years has made implantable retinal prostheses a promising therapeutic option in the near future for patients with macular degeneration or retinitis pigmentosa. Yet little work on devices that encode visual images into electrical stimuli have been reported to date. This paper presents a wearable image processor for use as the external module of a vision prosthesis. It is based on a dual-core microprocessor architecture and runs the Linux operating system. A set of image-processing algorithms executes on the digital signal processor of the device, which may be controlled remotely via a standard desktop computer. The results indicate that a highly flexible and configurable image processor can be built with the dual-core architecture. Depending on the image-processing requirements, general-purpose embedded microprocessors alone may be inadequate for implementing image-processing strategies required by retinal prostheses.

  6. FPGA Based Implementation of Pipelined 32-bit RISC Processor with Floating Point Unit

    Directory of Open Access Journals (Sweden)

    Jinde Vijay Kumar

    2014-04-01

    Full Text Available This paper presents 32-bit RISC processor with floating point unit to be designed using pipelined architecture; through this we can improve the speed of the operation as well as overall performance. This processor is developed especially for Arithmetic operations of both fixed and floating point numbers, branch and logical functions. The proposed architecture is able to prevent pipelining from flushing when branch instruction occurs and able to provide halt support. Floating point operations are widely used these days for many applications ranging from graphics application to medical imaging. Thus, the processor can be used for diversified application area. The necessary code is written in the hardware description language Verilog HDL. Quartus II 10.1 suite is used for software development; Modelsim is used for simulations and then implementation on Altera DE 2 FPGA board. Keywords -

  7. Low-Power and High Speed 128-Point Pipline FFT/IFFT Processor for OFDM Applications

    Directory of Open Access Journals (Sweden)

    D. Rajaveerappa

    2012-03-01

    Full Text Available This paper represents low power and high speed 128-point pipelined Fast Fourier Transform (FFT and its inverse Fast Fourier Transform (IFFT processor for OFDM. The Modified architecture also provides concept of ROM module and variable length support from 128~2048 point for FFT/IFFT for OFDM applications such as digital audio broadcasting (DAB, digital video broadcasting-terrestrial (DVB-T, asymmetric digital subscriber loop (ADSL and very-high-speed digital subscriber loop (VDSL. The 128-point architecture consists of an optimized pipeline implementation based on Radix-2 butterfly processor Element. To reduce power consumption and chip area, special current-mode SRAMs are adopted to replace shift registers in the delay lines. In low-power operation, when the supply voltage is scaled down to 2.3 V, the processor consumes 176mW when it runs at 17.8 MHz.

  8. The UA1 trigger processor

    CERN Document Server

    Grayer, G H

    1981-01-01

    Experiment UA1 is a large multipurpose spectrometer at the CERN proton-antiproton collider. The principal trigger is formed on the basis of the energy deposition in calorimeters. A trigger decision taken in under 2.4 microseconds can avoid dead-time losses due to the bunched nature of the beam. To achieve this fast 8-bit charge to digital converters have been built followed by two identical digital processors tailored to the experiment. The outputs of groups of the 2440 photomultipliers in the calorimeters are summed to form a total of 288 input channels to the ADCs. A look-up table in RAM is used to convert the digitised photomultiplier signals to energy in one processor, and to transverse energy in the other. Each processor forms four sums from a chosen combination of input channels, and also counts the number of clusters with electromagnetic or hadronic energy above pre-determined levels. Up to twelve combinations of these conditions, together with external information, may be combined in coincidence or in...

  9. A Real-Time High Performance Computation Architecture for Multiple Moving Target Tracking Based on Wide-Area Motion Imagery via Cloud and Graphic Processing Units

    Directory of Open Access Journals (Sweden)

    Kui Liu

    2017-02-01

    Full Text Available This paper presents the first attempt at combining Cloud with Graphic Processing Units (GPUs in a complementary manner within the framework of a real-time high performance computation architecture for the application of detecting and tracking multiple moving targets based on Wide Area Motion Imagery (WAMI. More specifically, the GPU and Cloud Moving Target Tracking (GC-MTT system applied a front-end web based server to perform the interaction with Hadoop and highly parallelized computation functions based on the Compute Unified Device Architecture (CUDA©. The introduced multiple moving target detection and tracking method can be extended to other applications such as pedestrian tracking, group tracking, and Patterns of Life (PoL analysis. The cloud and GPUs based computing provides an efficient real-time target recognition and tracking approach as compared to methods when the work flow is applied using only central processing units (CPUs. The simultaneous tracking and recognition results demonstrate that a GC-MTT based approach provides drastically improved tracking with low frame rates over realistic conditions.

  10. A Real-Time High Performance Computation Architecture for Multiple Moving Target Tracking Based on Wide-Area Motion Imagery via Cloud and Graphic Processing Units.

    Science.gov (United States)

    Liu, Kui; Wei, Sixiao; Chen, Zhijiang; Jia, Bin; Chen, Genshe; Ling, Haibin; Sheaff, Carolyn; Blasch, Erik

    2017-02-12

    This paper presents the first attempt at combining Cloud with Graphic Processing Units (GPUs) in a complementary manner within the framework of a real-time high performance computation architecture for the application of detecting and tracking multiple moving targets based on Wide Area Motion Imagery (WAMI). More specifically, the GPU and Cloud Moving Target Tracking (GC-MTT) system applied a front-end web based server to perform the interaction with Hadoop and highly parallelized computation functions based on the Compute Unified Device Architecture (CUDA©). The introduced multiple moving target detection and tracking method can be extended to other applications such as pedestrian tracking, group tracking, and Patterns of Life (PoL) analysis. The cloud and GPUs based computing provides an efficient real-time target recognition and tracking approach as compared to methods when the work flow is applied using only central processing units (CPUs). The simultaneous tracking and recognition results demonstrate that a GC-MTT based approach provides drastically improved tracking with low frame rates over realistic conditions.

  11. A Real-Time High Performance Computation Architecture for Multiple Moving Target Tracking Based on Wide-Area Motion Imagery via Cloud and Graphic Processing Units

    Science.gov (United States)

    Liu, Kui; Wei, Sixiao; Chen, Zhijiang; Jia, Bin; Chen, Genshe; Ling, Haibin; Sheaff, Carolyn; Blasch, Erik

    2017-01-01

    This paper presents the first attempt at combining Cloud with Graphic Processing Units (GPUs) in a complementary manner within the framework of a real-time high performance computation architecture for the application of detecting and tracking multiple moving targets based on Wide Area Motion Imagery (WAMI). More specifically, the GPU and Cloud Moving Target Tracking (GC-MTT) system applied a front-end web based server to perform the interaction with Hadoop and highly parallelized computation functions based on the Compute Unified Device Architecture (CUDA©). The introduced multiple moving target detection and tracking method can be extended to other applications such as pedestrian tracking, group tracking, and Patterns of Life (PoL) analysis. The cloud and GPUs based computing provides an efficient real-time target recognition and tracking approach as compared to methods when the work flow is applied using only central processing units (CPUs). The simultaneous tracking and recognition results demonstrate that a GC-MTT based approach provides drastically improved tracking with low frame rates over realistic conditions. PMID:28208684

  12. Adaptive Optoelectronic Eyes: Hybrid Sensor/Processor Architectures

    Science.gov (United States)

    2006-11-13

    J.  Lange , C. von der Malsburg, R. P. Würtz, and W. Konen, “Distortion Invariant Object Recognition Adaptive Optoelectronic Eyes: Hybrid Sensor...Meeting, Dallas, Texas, (November, 1998). 17.  G. Sáry, G. Kovács, K. Köteles, G.  Benedek , J. Fiser, and I. Biederman, “Selectivity Variations in Monkey

  13. A system architecture, processor, and communication protocol for secure implants

    NARCIS (Netherlands)

    C. Strydis (Christos); R.M. Seepers (Robert); P. Peris-Lopez (Pedro); D. Siskos (Dimitrios); I. Sourdis (Ioannis)

    2013-01-01

    textabstractSecure and energy-efficient communication between Implantable Medical Devices (IMDs) and authorized external users is attracting increasing attention these days. However, there currently exists no systematic approach to the problem, while solutions from neighboring fields, such as wirele

  14. A system architecture, processor, and communication protocol for secure implants

    NARCIS (Netherlands)

    C. Strydis (Christos); R.M. Seepers (Robert); P. Peris-Lopez (Pedro); D. Siskos (Dimitrios); I. Sourdis (Ioannis)

    2013-01-01

    textabstractSecure and energy-efficient communication between Implantable Medical Devices (IMDs) and authorized external users is attracting increasing attention these days. However, there currently exists no systematic approach to the problem, while solutions from neighboring fields, such as

  15. SBNR (Signed Binary Number Representations) Digital Signal Processor Architecture.

    Science.gov (United States)

    1987-05-31

    from Con’trolling Office) IS. SECURITY CLASS. (of Chia report) Defense Contract Audit Agency UNCLASSIFIED Denver Branch Office 158. OZCL ASSi FICATION...is shifted up and We contend that it holds promise of many new commercially written into the next higher row of PFe as each new video line available...presented at the nt. density (number of PFe per wafer) and in computational power, Con(. on Parallel Processing, 1985 we can also expect distribution of

  16. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    Directory of Open Access Journals (Sweden)

    Anuj Sharma

    2015-01-01

    F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.

  17. Towards a Systematic Exploration of the Optimization Space for Many-Core Processors

    NARCIS (Netherlands)

    Fang, J.

    2014-01-01

    The architecture diversity of many-core processors - with their different types of cores, and memory hierarchies - makes the old model of reprogramming every application for every platform infeasible. Therefore, inter-platform portability has become a desirable feature of programming models. While

  18. Towards a Systematic Exploration of the Optimization Space for Many-Core Processors

    NARCIS (Netherlands)

    Fang, J.

    2014-01-01

    The architecture diversity of many-core processors - with their different types of cores, and memory hierarchies - makes the old model of reprogramming every application for every platform infeasible. Therefore, inter-platform portability has become a desirable feature of programming models. While f

  19. Research on Superscalar Digital Signal Processor

    Institute of Scientific and Technical Information of China (English)

    Deng Zhenghong; Zheng Wei; Deng Lei; Hu Zhengguo

    2004-01-01

    Under the direction of design space theory,in this paper we discuss the design of a superscalar pipelining using the way of multiple issues,and the implement of a superscalar-based RISC DSP architecture,SDSP.Furthermore,in this paper we discuss the validity of instruction prefetch,the branch prediction,the depth of instruction window and other issues that can affect the performance of superscalar DSP.

  20. Architecture and Application-Aware Management of Complexity of Mapping Multiplication to FPGA DSP Blocks in High Level Synthesis

    Directory of Open Access Journals (Sweden)

    Sharad Sinha

    2014-01-01

    Full Text Available Multiplication is a common operation in many applications and there exist various types of multiplication operations. Current high level synthesis (HLS flows generally treat all multiplication operations equally and indistinguishable from each other leading to inefficient mapping to resources. This paper proposes algorithms for automatically identifying the different types of multiplication operations and investigates the ensemble of these different types of multiplication operations. This distinguishes it from previous works where mapping strategies for an individual type of multiplication operation have been investigated and the type of multiplication operation is assumed to be known a priori. A new cost model, independent of device and synthesis tools, for establishing priority among different types of multiplication operations for mapping to on-chip DSP blocks is also proposed. This cost model is used by a proposed analysis and priority ordering based mapping strategy targeted at making efficient use of hard DSP blocks on FPGAs while maximizing the operating frequency of designs. Results show that the proposed methodology could result in designs which were at least 2× faster in performance than those generated by commercial HLS tool: Vivado-HLS.

  1. Architectural Prototyping

    DEFF Research Database (Denmark)

    Bardram, Jakob; Christensen, Henrik Bærbak; Hansen, Klaus Marius

    2004-01-01

    ' concerns with respect to a system under development. An architectural prototype is primarily a learning and communication vehicle used to explore and experiment with alternative architectural styles, features, and patterns in order to balance different architectural qualities. The use of architectural......A major part of software architecture design is learning how specific architectural designs balance the concerns of stakeholders. We explore the notion of "architectural prototypes", correspondingly architectural prototyping, as a means of using executable prototypes to investigate stakeholders...

  2. Distributed digital signal processors for multi-body structures

    Science.gov (United States)

    Lee, Gordon K.

    1990-01-01

    Several digital filter designs were investigated which may be used to process sensor data from large space structures and to design digital hardware to implement the distributed signal processing architecture. Several experimental tests articles are available at NASA Langley Research Center to evaluate these designs. A summary of some of the digital filter designs is presented, an evaluation of their characteristics relative to control design is discussed, and candidate hardware microcontroller/microcomputer components are given. Future activities include software evaluation of the digital filter designs and actual hardware inplementation of some of the signal processor algorithms on an experimental testbed at NASA Langley.

  3. High performance deformable image registration algorithms for manycore processors

    CERN Document Server

    Shackleford, James; Sharp, Gregory

    2013-01-01

    High Performance Deformable Image Registration Algorithms for Manycore Processors develops highly data-parallel image registration algorithms suitable for use on modern multi-core architectures, including graphics processing units (GPUs). Focusing on deformable registration, we show how to develop data-parallel versions of the registration algorithm suitable for execution on the GPU. Image registration is the process of aligning two or more images into a common coordinate frame and is a fundamental step to be able to compare or fuse data obtained from different sensor measurements. E

  4. Addressing Thermal and Performance Variability Issues in Dynamic Processors

    Energy Technology Data Exchange (ETDEWEB)

    Yoshii, Kazutomo [Argonne National Lab. (ANL), Argonne, IL (United States); Llopis, Pablo [Univ. Carlos III de Madrid (Spain); Zhang, Kaicheng [Northwestern Univ., Evanston, IL (United States); Luo, Yingyi [Northwestern Univ., Evanston, IL (United States); Ogrenci-Memik, Seda [Northwestern Univ., Evanston, IL (United States); Memik, Gokhan [Northwestern Univ., Evanston, IL (United States); Sankaran, Rajesh [Argonne National Lab. (ANL), Argonne, IL (United States); Beckman, Pete [Argonne National Lab. (ANL), Argonne, IL (United States)

    2017-03-01

    As CMOS scaling nears its end, parameter variations (process, temperature and voltage) are becoming a major concern. To overcome parameter variations and provide stability, modern processors are becoming dynamic, opportunistically adjusting voltage and frequency based on thermal and energy constraints, which negatively impacts traditional bulk-synchronous parallelism-minded hardware and software designs. As node-level architecture is growing in complexity, implementing variation control mechanisms only with hardware can be a challenging task. In this paper we investigate a software strategy to manage hardwareinduced variations, leveraging low-level monitoring/controlling mechanisms.

  5. Hardware Implementation of a Genetic Algorithm Based Canonical Singed Digit Multiplierless Fast Fourier Transform Processor for Multiband Orthogonal Frequency Division Multiplexing Ultra Wideband Applications

    Directory of Open Access Journals (Sweden)

    Mahmud Benhamid

    2009-01-01

    Full Text Available Problem statement: Ultra Wide Band (UWB technology has attracted many researchers' attention due to its advantages and its great potential for future applications. The physical layer standard of Multi-band Orthogonal Frequency Division Multiplexing (MB-OFDM UWB system is defined by ECMA International. In this standard, the data sampling rate from the analog-to-digital converter to the physical layer is up to 528 M sample sec-1. Therefore, it is a challenge to realize the physical layer especially the components with high computational complexity in Very Large Scale Integration (VLSI implementation. Fast Fourier Transform (FFT block which plays an important role in MB-OFDM system is one of these components. Furthermore, the execution time of this module is only 312.5 ns. Therefore, if employing the traditional approach, high power consumption and hardware cost of the processor will be needed to meet the strict specifications of the UWB system. The objective of this study was to design an Application Specific Integrated Circuit (ASIC FFT processor for this system. The specification was defined from the system analysis and literature research. Approach: Based on the algorithm and architecture analysis, a novel Genetic Algorithm (GA based Canonical Signed Digit (CSD Multiplier less 128-point FFT processor and its inverse (IFFT for MB-OFDM UWB systems had been proposed. The proposed pipelined architecture was based on the modified Radix-22 algorithm that had same number of multipliers as that of the conventional Radix-22. However, the multiplication complexity and the ROM memory needed for storing twiddle factors coefficients could be eliminated by replacing the conventional complex multipliers with a newly proposed GA optimized CSD constant multipliers. The design had been coded in Verilog HDL and targeted Xilinx Virtex-II FPGA series. It was fully implemented and tested on real hardware using Virtex-II FG456 prototype board and logic analyzer

  6. Hardware Architecture Study for NASA's Space Software Defined Radios

    Science.gov (United States)

    Reinhart, Richard C.; Scardelletti, Maximilian C.; Mortensen, Dale J.; Kacpura, Thomas J.; Andro, Monty; Smith, Carl; Liebetreu, John

    2008-01-01

    This study defines a hardware architecture approach for software defined radios to enable commonality among NASA space missions. The architecture accommodates a range of reconfigurable processing technologies including general purpose processors, digital signal processors, field programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs) in addition to flexible and tunable radio frequency (RF) front-ends to satisfy varying mission requirements. The hardware architecture consists of modules, radio functions, and and interfaces. The modules are a logical division of common radio functions that comprise a typical communication radio. This paper describes the architecture details, module definitions, and the typical functions on each module as well as the module interfaces. Trade-offs between component-based, custom architecture and a functional-based, open architecture are described. The architecture does not specify the internal physical implementation within each module, nor does the architecture mandate the standards or ratings of the hardware used to construct the radios.

  7. Architecture on Architecture

    DEFF Research Database (Denmark)

    Olesen, Karen

    2016-01-01

    This paper will discuss the challenges faced by architectural education today. It takes as its starting point the double commitment of any school of architecture: on the one hand the task of preserving the particular knowledge that belongs to the discipline of architecture, and on the other hand...... the obligation to prepare students to perform in a profession that is largely defined by forces outside that discipline. It will be proposed that the autonomy of architecture can be understood as a unique kind of information: as architecture’s self-reliance or knowledge-about itself. A knowledge...... that is not scientific or academic but is more like a latent body of data that we find embedded in existing works of architecture. This information, it is argued, is not limited by the historical context of the work. It can be thought of as a virtual capacity – a reservoir of spatial configurations that can...

  8. Multipurpose silicon photonics signal processor core.

    Science.gov (United States)

    Pérez, Daniel; Gasulla, Ivana; Crudgington, Lee; Thomson, David J; Khokhar, Ali Z; Li, Ke; Cao, Wei; Mashanovich, Goran Z; Capmany, José

    2017-09-21

    Integrated photonics changes the scaling laws of information and communication systems offering architectural choices that combine photonics with electronics to optimize performance, power, footprint, and cost. Application-specific photonic integrated circuits, where particular circuits/chips are designed to optimally perform particular functionalities, require a considerable number of design and fabrication iterations leading to long development times. A different approach inspired by electronic Field Programmable Gate Arrays is the programmable photonic processor, where a common hardware implemented by a two-dimensional photonic waveguide mesh realizes different functionalities through programming. Here, we report the demonstration of such reconfigurable waveguide mesh in silicon. We demonstrate over 20 different functionalities with a simple seven hexagonal cell structure, which can be applied to different fields including communications, chemical and biomedical sensing, signal processing, multiprocessor networks, and quantum information systems. Our work is an important step toward this paradigm.Integrated optical circuits today are typically designed for a few special functionalities and require complex design and development procedures. Here, the authors demonstrate a reconfigurable but simple silicon waveguide mesh with different functionalities.

  9. Multiple 3d Approaches for the Architectural Study of the Medieval Abbey of Cormery in the Loire Valley

    Science.gov (United States)

    Pouyet, T.

    2017-02-01

    This paper will focus on the technical approaches used for a PhD thesis regarding architecture and spatial organization of benedict abbeys in Touraine in the Middle Ages, in particular the abbey of Cormery in the heart of the Loire Valley. Monastic space is approached in a diachronic way, from the early Middle Ages to the modern times using multi-sources data: architectural study, written sources, ancient maps, various iconographic documents… Many scales are used in the analysis, from the establishment of the abbeys in a territory to the scale of a building like the tower-entrance of the church of Cormery. These methodological axes have been developed in the research unit CITERES for many years and the 3D technology is now used to go further along in that field. The recording in 3D of the buildings of the abbey of Cormery allows us to work at the scale of the monastery and to produce useful data such as sections or orthoimages of the ground and the walls faces which are afterwards drawn and analysed. The study of these documents, crossed with the other historical sources, allowed us to emphasize the presence of walls older than what we thought and to discover construction elements that had not been recognized earlier and which enhance the debate about the construction date St Paul tower and associated the monastic church.

  10. Multicore technology architecture, reconfiguration, and modeling

    CERN Document Server

    Qadri, Muhammad Yasir

    2013-01-01

    The saturation of design complexity and clock frequencies for single-core processors has resulted in the emergence of multicore architectures as an alternative design paradigm. Nowadays, multicore/multithreaded computing systems are not only a de-facto standard for high-end applications, they are also gaining popularity in the field of embedded computing. The start of the multicore era has altered the concepts relating to almost all of the areas of computer architecture design, including core design, memory management, thread scheduling, application support, inter-processor communication, debu

  11. Efficient matrix inversion based on VLIW architecture

    Institute of Scientific and Technical Information of China (English)

    Li Zhang,Fu Li,; Guangming Shi

    2014-01-01

    Matrix inversion is a critical part in communication, signal processing and electromagnetic system. A flexible and scal-able very long instruction word (VLIW) processor with clustered architecture is proposed for matrix inversion. A global register file (RF) is used to connect al the clusters. Two nearby clusters share a local register file. The instruction sets are also designed for the VLIW processor. Experimental results show that the proposed VLIW architecture takes only 45 latency to invert a 4 × 4 matrix when running at 150 MHz. The proposed design is roughly five times faster than the DSP solution in processing speed.

  12. Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

    Science.gov (United States)

    Hadade, Ioan; di Mare, Luca

    2016-08-01

    Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.

  13. Software architecture evolution

    DEFF Research Database (Denmark)

    Barais, Olivier; Le Meur, Anne-Francoise; Duchien, Laurence

    2008-01-01

    Software architectures must frequently evolve to cope with changing requirements, and this evolution often implies integrating new concerns. Unfortunately, when the new concerns are crosscutting, existing architecture description languages provide little or no support for this kind of evolution....... The software architect must modify multiple elements of the architecture manually, which risks introducing inconsistencies. This chapter provides an overview, comparison and detailed treatment of the various state-of-the-art approaches to describing and evolving software architectures. Furthermore, we discuss...... one particular framework named Tran SAT, which addresses the above problems of software architecture evolution. Tran SAT provides a new element in the software architecture descriptions language, called an architectural aspect, for describing new concerns and their integration into an existing...

  14. Architectural Exploration of MPSoC Designs Based on an FPGA Emulation Framework

    OpenAIRE

    Valle, Del; Pablo, G.; Atienza, David; Magan, Ivan; Flores, Javier G.; Perez, Esther A.; Mendias, Jose M.; Benini, Luca; De Micheli, Giovanni

    2006-01-01

    With the growing complexity in consumer embedded products and the improvements in process technology, Multi-Processor System-On-Chip (MPSoC) architectures have become widespread. These new systems are very complex to design as they must execute multiple complex real-time applications (e.g. video processing, or videogames), while meeting several additional design constraints (e.g. energy consumption or time-to-market). Thus, in order to explore all the possible HW-SW configurations in a MPSoC,...

  15. Single-Scale Retinex Using Digital Signal Processors

    Science.gov (United States)

    Hines, Glenn; Rahman, Zia-Ur; Jobson, Daniel; Woodell, Glenn

    2005-01-01

    The Retinex is an image enhancement algorithm that improves the brightness, contrast and sharpness of an image. It performs a non-linear spatial/spectral transform that provides simultaneous dynamic range compression and color constancy. It has been used for a wide variety of applications ranging from aviation safety to general purpose photography. Many potential applications require the use of Retinex processing at video frame rates. This is difficult to achieve with general purpose processors because the algorithm contains a large number of complex computations and data transfers. In addition, many of these applications also constrain the potential architectures to embedded processors to save power, weight and cost. Thus we have focused on digital signal processors (DSPs) and field programmable gate arrays (FPGAs) as potential solutions for real-time Retinex processing. In previous efforts we attained a 21 (full) frame per second (fps) processing rate for the single-scale monochromatic Retinex with a TMS320C6711 DSP operating at 150 MHz. This was achieved after several significant code improvements and optimizations. Since then we have migrated our design to the slightly more powerful TMS320C6713 DSP and the fixed point TMS320DM642 DSP. In this paper we briefly discuss the Retinex algorithm, the performance of the algorithm executing on the TMS320C6713 and the TMS320DM642, and compare the results with the TMS320C6711.

  16. Design and implementation of a high performance network security processor

    Science.gov (United States)

    Wang, Haixin; Bai, Guoqiang; Chen, Hongyi

    2010-03-01

    The last few years have seen many significant progresses in the field of application-specific processors. One example is network security processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from network processors (NPs). This article presents a high performance NSP system architecture implementation intended for both internet protocol security (IPSec) and secure socket layer (SSL) protocol acceleration, which are widely employed in virtual private network (VPN) and e-commerce applications. The efficient dual one-way pipelined data transfer skeleton and optimised integration scheme of the heterogenous parallel crypto engine arrays lead to a Gbps rate NSP, which is programmable with domain specific descriptor-based instructions. The descriptor-based control flow fragments large data packets and distributes them to the crypto engine arrays, which fully utilises the parallel computation resources and improves the overall system data throughput. A prototyping platform for this NSP design is implemented with a Xilinx XC3S5000 based FPGA chip set. Results show that the design gives a peak throughput for the IPSec ESP tunnel mode of 2.85 Gbps with over 2100 full SSL handshakes per second at a clock rate of 95 MHz.

  17. Evaluation of the Intel Westmere-EP server processor

    CERN Document Server

    Jarp, S; Leduc, J; Nowak, A; CERN. Geneva. IT Department

    2010-01-01

    In this paper we report on a set of benchmark results recently obtained by CERN openlab when comparing the 6-core “Westmere-EP” processor with Intel’s previous generation of the same microarchitecture, the “Nehalem-EP”. The former is produced in a new 32nm process, the latter in 45nm. Both platforms are dual-socket servers. Multiple benchmarks were used to get a good understanding of the performance of the new processor. We used both industry-standard benchmarks, such as SPEC2006, and specific High Energy Physics benchmarks, representing both simulation of physics detectors and data analysis of physics events. Before summarizing the results we must stress the fact that benchmarking of modern processors is a very complex affair. One has to control (at least) the following features: processor frequency, overclocking via Turbo mode, the number of physical cores in use, the use of logical cores via Simultaneous Multi-Threading (SMT), the cache sizes available, the memory configuration installed, as well...

  18. Evaluation of the Intel Nehalem-EX server processor

    CERN Document Server

    Jarp, S; Leduc, J; Nowak, A; CERN. Geneva. IT Department

    2010-01-01

    In this paper we report on a set of benchmark results recently obtained by the CERN openlab by comparing the 4-socket, 32-core Intel Xeon X7560 server with the previous generation 4-socket server, based on the Xeon X7460 processor. The Xeon X7560 processor represents a major change in many respects, especially the memory sub-system, so it was important to make multiple comparisons. In most benchmarks the two 4-socket servers were compared. It should be underlined that both servers represent the “top of the line” in terms of frequency. However, in some cases, it was important to compare systems that integrated the latest processor features, such as QPI links, Symmetric multithreading and over-clocking via Turbo mode, and in such situations the X7560 server was compared to a dual socket L5520 based system with an identical frequency of 2.26 GHz. Before summarizing the results we must stress the fact that benchmarking of modern processors is a very complex affair. One has to control (at least) the following ...

  19. The breaking point of modern processor and platform technology

    CERN Document Server

    Nowak, A; Lazzaro, A; Leduc, J

    2011-01-01

    This work is an overview of state of the art processors used in High Energy Physics, their architecture and an extensive outline of the forthcoming technologies. Silicon process science and hardware design are making constant and rapid progress, and a solid grasp of these developments is imperative to the understanding of their possible future applications, which might include software strategy, optimizations, computing center operations and hardware acquisitions. In particular, the current issue of software and platform scalability is becoming more and more noticeable, and will develop in the near future with the growing core count of single chips and the approach of certain x86 architectural limits. Other topics brought forward include the hard, physical limits of innovation, the applicability of tried and tested computing formulas to modern technologies, as well as an analysis of viable alternate choices for continued development.

  20. The breaking point of modern processor and platform technology

    Science.gov (United States)

    Jarp, Sverre; Lazzaro, Alfio; Leduc, Julien; Nowak, Andrzej

    2011-12-01

    This work is an overview of state of the art processors used in High Energy Physics, their architecture and an extensive outline of the forthcoming technologies. Silicon process science and hardware design are making constant and rapid progress, and a solid grasp of these developments is imperative to the understanding of their possible future applications, which might include software strategy, optimizations, computing center operations and hardware acquisitions. In particular, the current issue of software and platform scalability is becoming more and more noticeable, and will develop in the near future with the growing core count of single chips and the approach of certain x86 architectural limits. Other topics brought forward include the hard, physical limits of innovation, the applicability of tried and tested computing formulas to modern technologies, as well as an analysis of viable alternate choices for continued development.

  1. Reducing adaptive optics latency using Xeon Phi many-core processors

    Science.gov (United States)

    Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah

    2015-11-01

    The next generation of Extremely Large Telescopes (ELTs) for astronomy will rely heavily on the performance of their adaptive optics (AO) systems. Real-time control is at the heart of the critical technologies that will enable telescopes to deliver the best possible science and will require a very significant extrapolation from current AO hardware existing for 4-10 m telescopes. Investigating novel real-time computing architectures and testing their eligibility against anticipated challenges is one of the main priorities of technology development for the ELTs. This paper investigates the suitability of the Intel Xeon Phi, which is a commercial off-the-shelf hardware accelerator. We focus on wavefront reconstruction performance, implementing a straightforward matrix-vector multiplication (MVM) algorithm. We present benchmarking results of the Xeon Phi on a real-time Linux platform, both as a standalone processor and integrated into an existing real-time controller (RTC). Performance of single and multiple Xeon Phis are investigated. We show that this technology has the potential of greatly reducing the mean latency and variations in execution time (jitter) of large AO systems. We present both a detailed performance analysis of the Xeon Phi for a typical E-ELT first-light instrument along with a more general approach that enables us to extend to any AO system size. We show that systematic and detailed performance analysis is an essential part of testing novel real-time control hardware to guarantee optimal science results.

  2. Architecture on Architecture

    DEFF Research Database (Denmark)

    Olesen, Karen

    2016-01-01

    that is not scientific or academic but is more like a latent body of data that we find embedded in existing works of architecture. This information, it is argued, is not limited by the historical context of the work. It can be thought of as a virtual capacity – a reservoir of spatial configurations that can...... the obligation to prepare students to perform in a profession that is largely defined by forces outside that discipline. It will be proposed that the autonomy of architecture can be understood as a unique kind of information: as architecture’s self-reliance or knowledge-about itself. A knowledge...... be transformed and reapplied endlessly through its confrontation with shifting information from outside the realms of architecture. A selection of architects’ statements on their own work will be used to demonstrate how in quite diverse contemporary practices the re-use of existing architectures is applied...

  3. Digital Signal Processor For GPS Receivers

    Science.gov (United States)

    Thomas, J. B.; Meehan, T. K.; Srinivasan, J. M.

    1989-01-01

    Three innovative components combined to produce all-digital signal processor with superior characteristics: outstanding accuracy, high-dynamics tracking, versatile integration times, lower loss-of-lock signal strengths, and infrequent cycle slips. Three components are digital chip advancer, digital carrier downconverter and code correlator, and digital tracking processor. All-digital signal processor intended for use in receivers of Global Positioning System (GPS) for geodesy, geodynamics, high-dynamics tracking, and ionospheric calibration.

  4. Architecture Design of a Variable Length Instruction Set VLIW DSP

    Institute of Scientific and Technical Information of China (English)

    SHEN Zheng; HE Hu; YANG Xu; JIA Di; SUN Yihe

    2009-01-01

    The cost of the central register file and the size of the program code limit the scalability of very long instruction word (VLIW) processors with increasing numbers of functional units. This paper presents the architectural design of a six-way VLIW digital signal processor (DSP) with clustered register files. The archi-tecture uses a variable length instruction set and supports dynamic instruction dispatching. The one-level memory system architecture of the processor includes 16-KB instruction and data caches and 16-KB in-struction and data on-chip RAM. A compiler based on the Open64 was developed for the system. Evalua-tions show that the processor is suitable for high performance applications with a high code density and small program code size.

  5. Data flow analysis of a highly parallel processor for a level 1 pixel trigger

    Energy Technology Data Exchange (ETDEWEB)

    Cancelo, G. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Gottschalk, Erik Edward [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Pavlicek, V. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Wang, M. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Wu, J. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)

    2003-01-01

    The present work describes the architecture and data flow analysis of a highly parallel processor for the Level 1 Pixel Trigger for the BTeV experiment at Fermilab. First the Level 1 Trigger system is described. Then the major components are analyzed by resorting to mathematical modeling. Also, behavioral simulations are used to confirm the models. Results from modeling and simulations are fed back into the system in order to improve the architecture, eliminate bottlenecks, allocate sufficient buffering between processes and obtain other important design parameters. An interesting feature of the current analysis is that the models can be extended to a large class of architectures and parallel systems.

  6. Power estimation on functional level for programmable processors

    Directory of Open Access Journals (Sweden)

    M. Schneider

    2004-01-01

    Full Text Available In diesem Beitrag werden verschiedene Ansätze zur Verlustleistungsschätzung von programmierbaren Prozessoren vorgestellt und bezüglich ihrer Übertragbarkeit auf moderne Prozessor-Architekturen wie beispielsweise Very Long Instruction Word (VLIW-Architekturen bewertet. Besonderes Augenmerk liegt hierbei auf dem Konzept der sogenannten Functional-Level Power Analysis (FLPA. Dieser Ansatz basiert auf der Einteilung der Prozessor-Architektur in funktionale Blöcke wie beispielsweise Processing-Unit, Clock-Netzwerk, interner Speicher und andere. Die Verlustleistungsaufnahme dieser Bl¨ocke wird parameterabhängig durch arithmetische Modellfunktionen beschrieben. Durch automatisierte Analyse von Assemblercodes des zu schätzenden Systems mittels eines Parsers können die Eingangsparameter wie beispielsweise der erzielte Parallelitätsgrad oder die Art des Speicherzugriffs gewonnen werden. Dieser Ansatz wird am Beispiel zweier moderner digitaler Signalprozessoren durch eine Vielzahl von Basis-Algorithmen der digitalen Signalverarbeitung evaluiert. Die ermittelten Schätzwerte für die einzelnen Algorithmen werden dabei mit physikalisch gemessenen Werten verglichen. Es ergibt sich ein sehr kleiner maximaler Schätzfehler von 3%. In this contribution different approaches for power estimation for programmable processors are presented and evaluated concerning their capability to be applied to modern digital signal processor architectures like e.g. Very Long InstructionWord (VLIW -architectures. Special emphasis will be laid on the concept of so-called Functional-Level Power Analysis (FLPA. This approach is based on the separation of the processor architecture into functional blocks like e.g. processing unit, clock network, internal memory and others. The power consumption of these blocks is described by parameter dependent arithmetic model functions. By application of a parser based automized analysis of assembler codes of the systems to be estimated

  7. The case for a generic implant processor.

    Science.gov (United States)

    Strydis, Christos; Gaydadjiev, Georgi N

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably selected processor simulator, various operational aspects of the application are being monitored. Findings on performance, cache behavior, branch prediction, power consumption, energy expenditure and instruction mixes are presented and analyzed. The suitability of such an implant processor and directions for future work are given.

  8. Alternative Water Processor Test Development

    Science.gov (United States)

    Pickering, Karen D.; Mitchell, Julie; Vega, Leticia; Adam, Niklas; Flynn, Michael; Wjee (er. Rau); Lunn, Griffin; Jackson, Andrew

    2012-01-01

    The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrogen and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.

  9. Alternative Water Processor Test Development

    Science.gov (United States)

    Pickering, Karen D.; Mitchell, Julie L.; Adam, Niklas M.; Barta, Daniel; Meyer, Caitlin E.; Pensinger, Stuart; Vega, Leticia M.; Callahan, Michael R.; Flynn, Michael; Wheeler, Ray; hide

    2013-01-01

    The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrification and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.

  10. Research and Evaluating of a Multiple-dimension Scalable Stream Architecture%多维可扩展流体系结构研究与评测

    Institute of Scientific and Technical Information of China (English)

    吴伟; 张春元; 文梅; 伍楠; 何义; 杨乾明; 管茂林; 荀长庆; 任巨; 柴俊

    2008-01-01

    MASA(Multiple-dimension scalable Stream Architecture)是一种可在多个维度扩展的流体系结构.本文对该体系结构的扩展性进行了深入探讨,分析了簇内、簇间和多核扩展的VLSI资源开销,并通过一组测试程序评测了MASA的性能.结果表明,三个扩展维度形成有利互补,使得MASA流体系结构可支持扩展到单片内集成上千个ALU.

  11. Allocating application to group of consecutive processors in fault-tolerant deadlock-free routing path defined by routers obeying same rules for path selection

    Science.gov (United States)

    Leung, Vitus J.; Phillips, Cynthia A.; Bender, Michael A.; Bunde, David P.

    2009-07-21

    In a multiple processor computing apparatus, directional routing restrictions and a logical channel construct permit fault tolerant, deadlock-free routing. Processor allocation can be performed by creating a linear ordering of the processors based on routing rules used for routing communications between the processors. The linear ordering can assume a loop configuration, and bin-packing is applied to this loop configuration. The interconnection of the processors can be conceptualized as a generally rectangular 3-dimensional grid, and the MC allocation algorithm is applied with respect to the 3-dimensional grid.

  12. Low Power Complex Multiplier based FFT Processor

    Directory of Open Access Journals (Sweden)

    V.Sarada

    2015-08-01

    Full Text Available High speed processing of signals has led to the requirement of very high speed conversion of signals from time domain to frequency domain. Recent years there has been increasing demand for low power designs in the field of Digital signal processing. Power consumption is the most important aspect while considering the system performance. In order to design high performance Fast Fourier Transform (FFT and realization, efficient internal structure is required. In this paper we present FFT Single Path Delay feedback (SDF pipeline architecture using radix -24 algorithm .The complex multiplier is realized by using Digit Slicing Concept multiplier less architecture. To reduce computation complexity radix 24 algorithms is used. The proposed design has been coded in Verilog HDL and synthesizes by Cadence tool. The result demonstrates that the power is reduced compared with complex multiplication used CSD (Canonic Signed Digit multiplier.

  13. Optimal processor allocation for sort-last compositing under BSP-tree ordering

    Science.gov (United States)

    Ramakrishnan, C. R.; Silva, Claudio T.

    1999-03-01

    In this paper, we consider a parallel rendering model that exploits the fundamental distinction between rendering and compositing operations, by assigning processors from specialized pools for each of these operations. Our motivation is to support the parallelization of general scan-line rendering algorithms with minimal effort, basically by supporting a compositing back-end (i.e., a sort-last architecture) that is able to perform user-controlled image composition. Our computational model is based on organizing rendering as well as compositing processors on a BSP-tree, whose internal nodes we call the compositing tree. Many known rendering algorithms, such as volumetric ray casting and polygon rendering can be easily parallelized based on the structure of the BSP-tree. In such a framework, it is paramount to minimize the processing power devoted to compositing, by minimizing the number of processors allocated for composition as well as optimizing the individual compositing operations. In this paper, we address the problems related to the static allocation of processor resources to the compositing tree. In particular, we present an optimal algorithm to allocate compositing operations to compositing processors. We also present techniques to evaluate the compositing operations within each processor using minimum memory while promoting concurrency between computation and communication. We describe the implementation details and provide experimental evidence of the validity of our techniques in practice.

  14. Making Effective Decisions in Computer Architects' Real-World: Lessons and Experiences with Godson-2 Processor Designs

    Institute of Scientific and Technical Information of China (English)

    Wei-Wu Hu; Jian Wang

    2008-01-01

    Although the design of many kinds of microprocessors has been under developing for several decades, the computer architecture R&D community lacks well documented lessons and experiences about design decisions in the research literature. In this paper, we systematically present the design decisions we made during the designing and prototyping of Godson-2 series processors. The 250MHz Godson-2B, 450MHz Godson-2C, and 1GHz Godson-2E processors that implement 64-bit, four-issue, out-of-order architecture were taped out in 2003, 2004, and 2005, respectively. Each processor triples its predecessor in the SPEC CPU2000 rates. Our first-hand experiences and lessons gained from these designs would provide unique perspectives and insights that are not available in any existing text books and/or published papers. We summarize 10 critical lessons and experiences based on hundreds of our attempts at architectural and design optimizations for performance improvement of Godson-2 series processors. Tile issues include silicon-simulation correlation, design balancing, performance optimizing, and pico-architecture tuning. We conclude that persistent improvement, attitude towards work-on-silicon design,and insightful understanding of software and fabrication process are the three most important factors for designing a high performance processor with low energy consumption.

  15. Rapid VLIW Processor Customization for Signal Processing Applications Using Combinational Hardware Functions

    Directory of Open Access Journals (Sweden)

    Hoare Raymond R

    2006-01-01

    Full Text Available This paper presents an architecture that combines VLIW (very long instruction word processing with the capability to introduce application-specific customized instructions and highly parallel combinational hardware functions for the acceleration of signal processing applications. To support this architecture, a compilation and design automation flow is described for algorithms written in C. The key contributions of this paper are as follows: (1 a 4-way VLIW processor implemented in an FPGA, (2 large speedups through hardware functions, (3 a hardware/software interface with zero overhead, (4 a design methodology for implementing signal processing applications on this architecture, (5 tractable design automation techniques for extracting and synthesizing hardware functions. Several design tradeoffs for the architecture were examined including the number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing blocks that execute multiply-accumulate operations. Using the MediaBench benchmark suite, we tested our methodology and architecture to accelerate software. Our combined VLIW processor with hardware functions was compared to that of software executing on a RISC processor, specifically the soft core embedded NIOS II processor. For software kernels converted into hardware functions, we show a hardware performance multiplier of up to times that of software with an average times faster. For the entire application in which only a portion of the software is converted to hardware, the performance improvement is as much as 30X times faster than the nonaccelerated application, with a 12X improvement on average.

  16. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Andreani, A; The ATLAS collaboration; Beccherle, R; Beretta, M; Cipriani, R; Citraro, S; Citterio, M; Colombo, A; Crescioli, F; Dimas, D; Donati, S; Giannetti, P; Kordas, K; Lanza, A; Liberali, V; Luciano, P; Magalotti, D; Neroutsos, P; Nikolaidis, S; Piendibene, M; Sakellariou, A; Shojaii, S; Sotiropoulou, C-L; Stabile, A

    2014-01-01

    The Associative Memory (AM) system of the FTK processor has been designed to perform pattern matching using the hit information of the ATLAS silicon tracker. The AM is the heart of the FTK and it finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside the FTK, multiple designs and tests have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of the AM chip, an ASIC designed and optimized to perform pattern matching, and two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. Special relevance will be given to the AMchip design that includes two custom cells optimized for low consumption. We repo...

  17. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed to execute pattern matching with a high degree of parallelism. The AM system finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 828 2 Gbit/s serial links for a total in/out bandwidth of 56 Gb/s. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. ...

  18. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed on purpose to execute pattern matching with a high degree of parallelism. It finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report on the performance of the intermedia...

  19. Sex without sex chromosomes: genetic architecture of multiple loci independently segregating to determine sex ratios in the copepod Tigriopus californicus.

    Science.gov (United States)

    Alexander, H J; Richardson, J M L; Edmands, S; Anholt, B R

    2015-12-01

    Sex-determining systems are remarkably diverse and may evolve rapidly. Polygenic sex-determination systems are predicted to be transient and evolutionarily unstable, yet examples have been reported across a range of taxa. Here, we provide the first direct evidence of polygenic sex determination in Tigriopus californicus, a harpacticoid copepod with no heteromorphic sex chromosomes. Using genetically distinct inbred lines selected for male- and female-biased clutches, we generated a genetic map with 39 SNPs across 12 chromosomes. Quantitative trait locus mapping of sex ratio phenotype (the proportion of male offspring produced by an F2 female) in four F2 families revealed six independently segregating quantitative trait loci on five separate chromosomes, explaining 19% of the variation in sex ratios. The sex ratio phenotype varied among loci across chromosomes in both direction and magnitude, with the strongest phenotypic effects on chromosome 10 moderated to some degree by loci on four other chromosomes. For a given locus, sex ratio phenotype varied in magnitude for individuals derived from different dam lines. These data, together with the environmental factors known to contribute to sex determination, characterize the underlying complexity and potential lability of sex determination, and confirm the polygenic architecture of sex determination in T. californicus.

  20. Software implementation of floating-Point arithmetic on a reduced-Instruction-set processor

    Energy Technology Data Exchange (ETDEWEB)

    Gross, T.

    1985-11-01

    Current single chip implementations of reduced-instruction-set processors do not support hardware floating-point operations. Instead, floating-point operations have to be provided either by a coprocessor or by software. This paper discusses issues arising from a software implementation of floating-point arithmetic for the MIPS processor, an experimental VLSI architecture. Measurements indicate that an acceptable level of performance is achieved, but this approach is no substitute for a hardware accelerator if higher-precision results are required. This paper includes instruction profiles for the basic floating-point operations and evaluates the usefulness of some aspects of the instruction set.

  1. Building RNC in All-IP Wireless Networks using Network Processors

    Institute of Scientific and Technical Information of China (English)

    CHENG Sheng; NI Xian-le; ZHU Xin-ning; DING Wei

    2004-01-01

    This paper describes a solution to build network-processor-based Radio Network Controller (RNC) in all-IP wireless networks, it includes the structure of the 3rd Generation (3G) wireless networks and the role of network nodes, such as Base Station (BS), RNC, and Packet-Switched Core Networks (PSCN). The architecture of IXP2800 network processor; the detailed implementation of the solution on IXP2800-based RNC are also covered. This solution can provide scalable IP forward features and it will be widely used in 3G RNCs.

  2. Next generation Associative Memory devices for the FTK tracking processor of the ATLAS experiment

    CERN Document Server

    Andreani, A; The ATLAS collaboration; Beccherle, B; Beretta, M; Citterio, M; Crescioli, F; Colombo, A; Giannetti, P; Liberali, V; Shojaii, J; Stabile, A

    2013-01-01

    The AMchip is a VLSI device that implements the associative memory function, a special content addressable memory specifically designed for high energy physics applications and first used in the CDF experiment at Tevatron. The 4th generation of AMchip has been developed for the core pattern recognition stage of the Fast TracKer (FTK) processor: a hardware processor for online reconstruction of particle trajectories at the ATLAS experiment at LHC. We present the architecture, design considerations, power consumption and performance measurements of the 4th generation of AMchip. We present also the design innovations toward the 5th generation and the first prototype results.

  3. Geometric Design Rule Check of VLSI Layouts in Mesh Connected Processors

    Directory of Open Access Journals (Sweden)

    S. K. Nandy

    1994-01-01

    Full Text Available Design Rule Checking is a compute-intensive VLSI CAD tool. In this paper we propose a parallel algorithm to perform Design Rule Check (DRC of Layout geometries in a VLSI layout. The algorithm assumes the parallel architecture to be a two-dimensional mesh of processors. The algorithm is based on a linear quadtree representation of the layout. Through a complexity analysis it is shown that it is possible to achieve a linear speedup in DRC with respect to the number of processors.

  4. Computer architecture fundamentals and principles of computer design

    CERN Document Server

    Dumas II, Joseph D

    2005-01-01

    Introduction to Computer ArchitectureWhat is Computer Architecture?Architecture vs. ImplementationBrief History of Computer SystemsThe First GenerationThe Second GenerationThe Third GenerationThe Fourth GenerationModern Computers - The Fifth GenerationTypes of Computer SystemsSingle Processor SystemsParallel Processing SystemsSpecial ArchitecturesQuality of Computer SystemsGenerality and ApplicabilityEase of UseExpandabilityCompatibilityReliabilitySuccess and Failure of Computer Architectures and ImplementationsQuality and the Perception of QualityCost IssuesArchitectural Openness, Market Timi

  5. DSP algorithms in FPGA: proposition of a new architecture

    Science.gov (United States)

    Kolasinski, Piotr; Zabolotny, Wojciech

    2008-01-01

    This paper presents a new reconfigurable architecture created in FPGA which is optimized for DSP algorithms like digital filters or digital transforms. The architecture tries to combine advantages of typical architectures like DSP processors and datapath architecture, while avoiding their drawbacks. The architecture is built from blocks called Operational Units (OU). Each Operational Unit contains the Control Unit (CU), which controls its operation. The Operational Units may operate in parallel, which shortens the processing time. This structure is also highly flexible, because all OUs may operate independently, executing their own programs. User may customize connections between units and modify architecture by adding new modules.

  6. The Potential of the Cell Processor for Scientific Computing

    Energy Technology Data Exchange (ETDEWEB)

    Williams, Samuel; Shalf, John; Oliker, Leonid; Husbands, Parry; Kamil, Shoaib; Yelick, Katherine

    2005-10-14

    The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of the using the forth coming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. We are the first to present quantitative Cell performance data on scientific kernels and show direct comparisons against leading superscalar (AMD Opteron), VLIW (IntelItanium2), and vector (Cray X1) architectures. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop both analytical models and simulators to predict kernel performance. Our work also explores the complexity of mapping several important scientific algorithms onto the Cells unique architecture. Additionally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

  7. Application developer's tutorial for the CSM testbed architecture

    Science.gov (United States)

    Underwood, Phillip; Felippa, Carlos A.

    1988-01-01

    This tutorial serves as an illustration of the use of the programmer interface on the CSM Testbed Architecture (NICE). It presents a complete, but simple, introduction to using both the GAL-DBM (Global Access Library-Database Manager) and CLIP (Command Language Interface Program) to write a NICE processor. Familiarity with the CSM Testbed architecture is required.

  8. Adapting implicit methods to parallel processors

    Energy Technology Data Exchange (ETDEWEB)

    Reeves, L.; McMillin, B.; Okunbor, D.; Riggins, D. [Univ. of Missouri, Rolla, MO (United States)

    1994-12-31

    When numerically solving many types of partial differential equations, it is advantageous to use implicit methods because of their better stability and more flexible parameter choice, (e.g. larger time steps). However, since implicit methods usually require simultaneous knowledge of the entire computational domain, these methods axe difficult to implement directly on distributed memory parallel processors. This leads to infrequent use of implicit methods on parallel/distributed systems. The usual implementation of implicit methods is inefficient due to the nature of parallel systems where it is common to take the computational domain and distribute the grid points over the processors so as to maintain a relatively even workload per processor. This creates a problem at the locations in the domain where adjacent points are not on the same processor. In order for the values at these points to be calculated, messages have to be exchanged between the corresponding processors. Without special adaptation, this will result in idle processors during part of the computation, and as the number of idle processors increases, the lower the effective speed improvement by using a parallel processor.

  9. Multi-output programmable quantum processor

    OpenAIRE

    Yu, Yafei; Feng, Jian; Zhan, Mingsheng

    2002-01-01

    By combining telecloning and programmable quantum gate array presented by Nielsen and Chuang [Phys.Rev.Lett. 79 :321(1997)], we propose a programmable quantum processor which can be programmed to implement restricted set of operations with several identical data outputs. The outputs are approximately-transformed versions of input data. The processor successes with certain probability.

  10. 7 CFR 1215.14 - Processor.

    Science.gov (United States)

    2010-01-01

    ... AND ORDERS; MISCELLANEOUS COMMODITIES), DEPARTMENT OF AGRICULTURE POPCORN PROMOTION, RESEARCH, AND CONSUMER INFORMATION Popcorn Promotion, Research, and Consumer Information Order Definitions § 1215.14 Processor. Processor means a person engaged in the preparation of unpopped popcorn for the market who...

  11. The Case for a Generic Implant Processor

    NARCIS (Netherlands)

    Strydis, C.; Gaydadjiev, G.N.

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably se

  12. An Empirical Evaluation of XQuery Processors

    NARCIS (Netherlands)

    Manegold, S.

    2008-01-01

    This paper presents an extensive and detailed experimental evaluation of XQuery processors. The study consists of running five publicly available XQuery benchmarks --- the Michigan benchmark (MBench), XBench, XMach-1, XMark and X007 --- on six XQuery processors, three stand-alone (file-based) XQuery

  13. The Case for a Generic Implant Processor

    NARCIS (Netherlands)

    Strydis, C.; Gaydadjiev, G.N.

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably

  14. Towards a Process Algebra for Shared Processors

    DEFF Research Database (Denmark)

    Buchholtz, Mikael; Andersen, Jacob; Løvengreen, Hans Henrik

    2002-01-01

    We present initial work on a timed process algebra that models sharing of processor resources allowing preemption at arbitrary points in time. This enables us to model both the functional and the timely behaviour of concurrent processes executed on a single processor. We give a refinement relation...

  15. Novel WLL Architecture Based on Color Pixel Multiple Access Implemented on a Terrestrial Video Network as the Overlay

    DEFF Research Database (Denmark)

    Sanyal, Rajarshi; Cianca, Ernestina; Prasad, Ramjee

    2013-01-01

    for conveying data symbols from one end to other. The big question is: Is it feasible to implement color synthesized by the video systems for the purpose of telecommunications? In this paper we propose the ‘Color Pixel Multiple Access’ scheme for the radio access network and Color Pixel Multiplexing for core...... network, by implementing electronic color as a tool for addressing and bearing data overhead. The present day video systems that can generate millions of colors, in its electronic form have been utilized to set up a wireless network, serving mobile stations or computers as its nodes. The state of the art...

  16. The Chameleon Architecture for Streaming DSP Applications

    Directory of Open Access Journals (Sweden)

    André B. J. Kokkeler

    2007-02-01

    Full Text Available We focus on architectures for streaming DSP applications such as wireless baseband processing and image processing. We aim at a single generic architecture that is capable of dealing with different DSP applications. This architecture has to be energy efficient and fault tolerant. We introduce a heterogeneous tiled architecture and present the details of a domain-specific reconfigurable tile processor called Montium. This reconfigurable processor has a small footprint (1.8 mm2 in a 130 nm process, is power efficient and exploits the locality of reference principle. Reconfiguring the device is very fast, for example, loading the coefficients for a 200 tap FIR filter is done within 80 clock cycles. The tiles on the tiled architecture are connected to a Network-on-Chip (NoC via a network interface (NI. Two NoCs have been developed: a packet-switched and a circuit-switched version. Both provide two types of services: guaranteed throughput (GT and best effort (BE. For both NoCs estimates of power consumption are presented. The NI synchronizes data transfers, configures and starts/stops the tile processor. For dynamically mapping applications onto the tiled architecture, we introduce a run-time mapping tool.

  17. The Chameleon Architecture for Streaming DSP Applications

    Directory of Open Access Journals (Sweden)

    Heysters PaulM

    2007-01-01

    Full Text Available We focus on architectures for streaming DSP applications such as wireless baseband processing and image processing. We aim at a single generic architecture that is capable of dealing with different DSP applications. This architecture has to be energy efficient and fault tolerant. We introduce a heterogeneous tiled architecture and present the details of a domain-specific reconfigurable tile processor called Montium. This reconfigurable processor has a small footprint (1.8 mm2 in a 130 nm process, is power efficient and exploits the locality of reference principle. Reconfiguring the device is very fast, for example, loading the coefficients for a 200 tap FIR filter is done within 80 clock cycles. The tiles on the tiled architecture are connected to a Network-on-Chip (NoC via a network interface (NI. Two NoCs have been developed: a packet-switched and a circuit-switched version. Both provide two types of services: guaranteed throughput (GT and best effort (BE. For both NoCs estimates of power consumption are presented. The NI synchronizes data transfers, configures and starts/stops the tile processor. For dynamically mapping applications onto the tiled architecture, we introduce a run-time mapping tool.

  18. AltiVec performance increases for autonomous robotics for the MARSSCAPE architecture program

    Science.gov (United States)

    Gothard, Benny M.

    2002-02-01

    One of the main tall poles that must be overcome to develop a fully autonomous vehicle is the inability of the computer to understand its surrounding environment to a level that is required for the intended task. The military mission scenario requires a robot to interact in a complex, unstructured, dynamic environment. Reference A High Fidelity Multi-Sensor Scene Understanding System for Autonomous Navigation The Mobile Autonomous Robot Software Self Composing Adaptive Programming Environment (MarsScape) perception research addresses three aspects of the problem; sensor system design, processing architectures, and algorithm enhancements. A prototype perception system has been demonstrated on robotic High Mobility Multi-purpose Wheeled Vehicle and All Terrain Vehicle testbeds. This paper addresses the tall pole of processing requirements and the performance improvements based on the selected MarsScape Processing Architecture. The processor chosen is the Motorola Altivec-G4 Power PC(PPC) (1998 Motorola, Inc.), a highly parallized commercial Single Instruction Multiple Data processor. Both derived perception benchmarks and actual perception subsystems code will be benchmarked and compared against previous Demo II-Semi-autonomous Surrogate Vehicle processing architectures along with desktop Personal Computers(PC). Performance gains are highlighted with progress to date, and lessons learned and future directions are described.

  19. Processor core model for quantum computing.

    Science.gov (United States)

    Yung, Man-Hong; Benjamin, Simon C; Bose, Sougato

    2006-06-09

    We describe an architecture based on a processing "core," where multiple qubits interact perpetually, and a separate "store," where qubits exist in isolation. Computation consists of single qubit operations, swaps between the store and the core, and free evolution of the core. This enables computation using physical systems where the entangling interactions are "always on." Alternatively, for switchable systems, our model constitutes a prescription for optimizing many-qubit gates. We discuss implementations of the quantum Fourier transform, Hamiltonian simulation, and quantum error correction.

  20. Neurovision processor for designing intelligent sensors

    Science.gov (United States)

    Gupta, Madan M.; Knopf, George K.

    1992-03-01

    A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.

  1. A Microprocessor Architecture for Bibliographic Retrieval System.

    Science.gov (United States)

    Martella, G.; Gobbi, G.

    1981-01-01

    Proposes a microprocessor-based architecture that makes large use of parallelism both in processing and in retrieval operations. The proposed system consists of three functional blocks: the query processor, simple query executers, and the answer composer. Twenty-one references are listed. (FM)

  2. ASAM: Automatic architecture synthesis and application mapping

    DEFF Research Database (Denmark)

    Jozwiak, Lech; Lindwer, Menno; Corvino, Rosilde

    2013-01-01

    This paper focuses on mastering the automatic architecture synthesis and application mapping for heterogeneous massively-parallel MPSoCs based on customizable application-specific instruction-set processors (ASIPs). It presents an overview of the research being currently performed in the scope of...

  3. Automatic Hardware Generation for Reconfigurable Architectures

    NARCIS (Netherlands)

    Nane, R.

    2014-01-01

    Reconfigurable Architectures (RA) have been gaining popularity rapidly in the last decade for two reasons. First, processor clock frequencies reached threshold values past which power dissipation becomes a very difficult problem to solve. As a consequence, alternatives were sought to keep improving

  4. NMRFx Processor: a cross-platform NMR data processing program.

    Science.gov (United States)

    Norris, Michael; Fetler, Bayard; Marchant, Jan; Johnson, Bruce A

    2016-08-01

    NMRFx Processor is a new program for the processing of NMR data. Written in the Java programming language, NMRFx Processor is a cross-platform application and runs on Linux, Mac OS X and Windows operating systems. The application can be run in both a graphical user interface (GUI) mode and from the command line. Processing scripts are written in the Python programming language and executed so that the low-level Java commands are automatically run in parallel on computers with multiple cores or CPUs. Processing scripts can be generated automatically from the parameters of NMR experiments or interactively constructed in the GUI. A wide variety of processing operations are provided, including methods for processing of non-uniformly sampled datasets using iterative soft thresholding. The interactive GUI also enables the use of the program as an educational tool for teaching basic and advanced techniques in NMR data analysis.

  5. Architectural slicing

    DEFF Research Database (Denmark)

    Christensen, Henrik Bærbak; Hansen, Klaus Marius

    2013-01-01

    a system and a slicing criterion, architectural slicing produces an architectural prototype that contain the elements in the architecture that are dependent on the ele- ments in the slicing criterion. Furthermore, we present an initial design and implementation of an architectural slicer for Java.......Architectural prototyping is a widely used practice, con- cerned with taking architectural decisions through experiments with light- weight implementations. However, many architectural decisions are only taken when systems are already (partially) implemented. This is prob- lematic in the context...... of architectural prototyping since experiments with full systems are complex and expensive and thus architectural learn- ing is hindered. In this paper, we propose a novel technique for harvest- ing architectural prototypes from existing systems, \\architectural slic- ing", based on dynamic program slicing. Given...

  6. Architectural Slicing

    DEFF Research Database (Denmark)

    Christensen, Henrik Bærbak; Hansen, Klaus Marius

    2013-01-01

    a system and a slicing criterion, architectural slicing produces an architectural prototype that contain the elements in the architecture that are dependent on the ele- ments in the slicing criterion. Furthermore, we present an initial design and implementation of an architectural slicer for Java.......Architectural prototyping is a widely used practice, con- cerned with taking architectural decisions through experiments with light- weight implementations. However, many architectural decisions are only taken when systems are already (partially) implemented. This is prob- lematic in the context...... of architectural prototyping since experiments with full systems are complex and expensive and thus architectural learn- ing is hindered. In this paper, we propose a novel technique for harvest- ing architectural prototypes from existing systems, \\architectural slic- ing", based on dynamic program slicing. Given...

  7. Migration of vectorized iterative solvers to distributed memory architectures

    Energy Technology Data Exchange (ETDEWEB)

    Pommerell, C. [AT& T Bell Labs., Murray Hill, NJ (United States); Ruehl, R. [CSCS-ETH, Manno (Switzerland)

    1994-12-31

    Both necessity and opportunity motivate the use of high-performance computers for iterative linear solvers. Necessity results from the size of the problems being solved-smaller problems are often better handled by direct methods. Opportunity arises from the formulation of the iterative methods in terms of simple linear algebra operations, even if this {open_quote}natural{close_quotes} parallelism is not easy to exploit in irregularly structured sparse matrices and with good preconditioners. As a result, high-performance implementations of iterative solvers have attracted a lot of interest in recent years. Most efforts are geared to vectorize or parallelize the dominating operation-structured or unstructured sparse matrix-vector multiplication, or to increase locality and parallelism by reformulating the algorithm-reducing global synchronization in inner products or local data exchange in preconditioners. Target architectures for iterative solvers currently include mostly vector supercomputers and architectures with one or few optimized (e.g., super-scalar and/or super-pipelined RISC) processors and hierarchical memory systems. More recently, parallel computers with physically distributed memory and a better price/performance ratio have been offered by vendors as a very interesting alternative to vector supercomputers. However, programming comfort on such distributed memory parallel processors (DMPPs) still lags behind. Here the authors are concerned with iterative solvers and their changing computing environment. In particular, they are considering migration from traditional vector supercomputers to DMPPs. Application requirements force one to use flexible and portable libraries. They want to extend the portability of iterative solvers rather than reimplementing everything for each new machine, or even for each new architecture.

  8. The Square Kilometre Array Science Data Processor. Preliminary compute platform design

    Science.gov (United States)

    Broekema, P. C.; van Nieuwpoort, R. V.; Bal, H. E.

    2015-07-01

    The Square Kilometre Array is a next-generation radio-telescope, to be built in South Africa and Western Australia. It is currently in its detailed design phase, with procurement and construction scheduled to start in 2017. The SKA Science Data Processor is the high-performance computing element of the instrument, responsible for producing science-ready data. This is a major IT project, with the Science Data Processor expected to challenge the computing state-of-the art even in 2020. In this paper we introduce the preliminary Science Data Processor design and the principles that guide the design process, as well as the constraints to the design. We introduce a highly scalable and flexible system architecture capable of handling the SDP workload.

  9. XOP: a second generation fast processor for on-line use in high energy physics experiments

    CERN Document Server

    Lingjaerde, Tor

    1981-01-01

    Processors for trigger calculations and data compression in high energy physics are characterized by a high data input capability combined with fast execution of relatively simple routines. In order to achieve the required performance it is advantageous to replace the classical computer instruction-set by microcoded instructions, the various fields of which control the internal subunits in parallel. The fast processor called ESOP is based on such a principle: the different operations are handled step by step by dedicated optimized modules under control of a central instruction unit. Thus, the arithmetic operations, address calculations, conditional checking, loop counts and next instruction evaluation all overlap in time. Based upon the experience from ESOP the architecture of a new processor "XOP" is beginning to take shape which will be faster and easier to use. In this context the most important innovations are: easy handling of operands in the arithmetic unit by means of three data buses and large data fi...

  10. Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays

    Energy Technology Data Exchange (ETDEWEB)

    Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

    2005-09-20

    At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over {mu}Clinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems.

  11. Implementing kinematics computation in FPGA co-processor for a 6-DOF space manipulator

    Institute of Scientific and Technical Information of China (English)

    Zheng Yili; Sun Hanxu; Jia Qingxuan; Shi Guozhen

    2009-01-01

    Based on the coordinate rotation digital computer (CORDIC) algorithm, the high-speed kinematics calculation for a six degree of freedom (DOF) space manipulator is implemented in a field programmable gate array (FPGA) co-processor. A pipeline architecture is adopted to reduce the complexity and time-consumption of the kinematics calculation. The CORDIC soft-core and the CORDIC-based pipelined kinematics calculation co-processor are described with the very-high-speed integrated circuit hardware description language (VHDL) language and realized in the FPGA. Finally, the feasibility of the design is validated in the Spartan-3 FPGA of Xilinx Inc., and the performance specifications of FPGA co-processor are discussed. The results show that time-consumption of the kinematics calculation is greatly reduced.

  12. Instruction-level Real-time Secure Processor Using an Error Correction Code

    Directory of Open Access Journals (Sweden)

    YOON, S. M.

    2015-08-01

    Full Text Available In this paper, we present a processor that detects security-attacks at the instruction level by checking the integrity of instructions in real time. To confirm the integrity of the instructions, we generate a parity chain of instructions and check them at run time. The parity chain is generated using an error correction code used in a digital communication system, and the integrity checker has the same function as the error-detector module of the error correction code. This architecture can readily be applied to a general processor, because the checker is located between the processor core and the instruction memory. Compared with other cipher modules with the same key space, our instruction integrity checker achieves a faster check speed and occupies a smaller area.

  13. Optimizing performance of superscalar codes for a single Cray X1MSP processor

    Energy Technology Data Exchange (ETDEWEB)

    Shan, Hongzhang; Strohmaier, Erich; Oliker, Leonid

    2004-06-08

    The growing gap between sustained and peak performance for full-scale complex scientific applications on conventional supercomputers is a major concern in high performance computing. The recently-released vector-based Cray X1 offers to bridge this gap for many demanding scientific applications. However, this unique architecture contains both data caches and multi-streaming processing units, and the optimal programming methodology is still under investigation. In this paper we investigate Cray X1 code optimization for a suite of computational kernels originally designed for superscalar processors. For our study, we select four applications from the SPLASH2 application suite (1-D FFT,Radix, Ocean, and Nbody), two kernels from the NAS benchmark suite (3-DFFT and CG), and a matrix-matrix multiplication kernel. Results show that for many cases, the addition of vectorization compiler directives results faster runtimes. However, to achieve a significant performance improvement via increased vector length, it is often necessary to restructure the program at the source level sometimes leading to algorithmic level transformations. Additionally, memory bank conflicts may result in substantial performance losses. These conflicts can often be exacerbated when optimizing code for increased vector lengths, and must be explicitly minimized. Finally, we investigate the relationship of the X1 data caches on overall performance.

  14. DESIGN OF A RECONFIGURABLE DSP PROCESSOR WITH BIT EFFICIENT RESIDUE NUMBER SYSTEM

    Directory of Open Access Journals (Sweden)

    Chaitali Biswas Dutta

    2012-10-01

    Full Text Available Residue Number System (RNS, which originates from the Chinese Remainder Theorem, offers a promising future in VLSI because of its carry-free operations in addition, subtraction and multiplication. This property of RNS is very helpful to reduce the complexity of calculation in many applications. A residue number system represents a large integer using a set of smaller integers, called residues. But the area overhead, cost and speed not only depend on this word length, but also the selection of moduli, which is a very crucial step for residue system. This parameter determines bit efficiency, area, frequency etc. In this paper a new moduli set selection technique is proposed to improve bit efficiency which can be used to construct a residue system for digital signal processing environment. Subsequently, it is theoretically proved and illustrated using examples, that the proposed solution gives better results than the schemes reported in the literature. The novelty of the architecture is shown by comparison the different schemes reported in the literature. Using the novel moduli set, a guideline for a Reconfigurable Processor is presented here that can process some predefined functions. As RNS minimizes the carry propagation, the scheme can be implemented in Real Time Signal Processing & other fields where high speed computations are required.

  15. LOW COMPLEXITY CONSTRAINTS FOR ENERGY AND PERFORMANCE MANAGEMENT OF HETEROGENEOUS MULTICORE PROCESSORS USING DYNAMIC OPTIMIZATION

    Directory of Open Access Journals (Sweden)

    A. S. Radhamani

    2014-01-01

    Full Text Available Optimization in multicore processor environment is significant in real world dynamic applications, as it is crucial to find and track the change effectively over time, which requires an optimization algorithm. In massively parallel processing multicore processor architectures, like other population based metaheuristics Constraint based Bacterial Foraging Particle Swarm Optimization (CBFPSO scheduling can be effectively implemented. In this study we discuss possible approaches to parallelize CBFPSO in multicore system, which uses different constraints; to exploit parallelism are explored and evaluated. Due to the ability of keeping good balance between convergence and maintenance, for real world applications, among the various algorithms for parallel architecture optimization CBFPSOs are attracting more and more attentions in recent years. To tackle the challenges of parallel architecture optimization, several strategies have been proposed, to enhance the performance of Particle Swarm Optimization (PSO and have obtained success on various multicore parallel architecture optimization problems. But there still exist some issues in multicore architectures which require to be analyzed carefully. In this study, a new Constraint based Bacterial Foraging Particle Swarm Optimization (CBFPSO scheduling for multicore architecture is proposed, which updates the velocity and position by two bacterial behaviours, i.e., reproduction and elimination dispersal. The performance of CBFPSO is compared with the simulation results of GA and the result shows that the proposed algorithm has pretty good performance on almost all types of cores compared to GA with respect to completion time and energy consumption.

  16. A Distributed DB Architecture for Processing cPIR Queries

    Directory of Open Access Journals (Sweden)

    Sultan.M

    2013-06-01

    Full Text Available Information Retrieval is the Process of obtaining materials, usually documents from unstructured huge volume of data. Several Protocols are available to retrieve bit information available in the distributed databases. A Cloud framework provides a platform for private information retrieval. In this article, we combine the artifacts of the distributed system with Cloud framework for extracting information from unstructured databases. The process involves distributing the database to a number of co-operative peers which will reduce the response of the query by influencing computational resources in the peer. A single query is subdivided into multiple queries and processed in parallel across the distributed sites. Our Simulation results using Cloud Sim shows that this distributed database architecture reduces the cost of computational Private Information Retrieval with reduced response time and processor overload in peer sites.

  17. Architectural Prototyping

    DEFF Research Database (Denmark)

    Bardram, Jakob; Christensen, Henrik Bærbak; Hansen, Klaus Marius

    2004-01-01

    A major part of software architecture design is learning how specific architectural designs balance the concerns of stakeholders. We explore the notion of "architectural prototypes", correspondingly architectural prototyping, as a means of using executable prototypes to investigate stakeholders......' concerns with respect to a system under development. An architectural prototype is primarily a learning and communication vehicle used to explore and experiment with alternative architectural styles, features, and patterns in order to balance different architectural qualities. The use of architectural...... prototypes in the development process is discussed, and we argue that such prototypes can play a role throughout the entire process. The use of architectural prototypes is illustrated by three distinct cases of creating software systems. We argue that architectural prototyping can provide key insights...

  18. Architectural prototyping

    DEFF Research Database (Denmark)

    Bardram, Jakob Eyvind; Christensen, Henrik Bærbak; Hansen, Klaus Marius

    2004-01-01

    A major part of software architecture design is learning how specific architectural designs balance the concerns of stakeholders. We explore the notion of "architectural prototypes", correspondingly architectural prototyping, as a means of using executable prototypes to investigate stakeholders......' concerns with respect to a system under development. An architectural prototype is primarily a learning and communication vehicle used to explore and experiment with alternative architectural styles, features, and patterns in order to balance different architectural qualities. The use of architectural...... prototypes in the development process is discussed, and we argue that such prototypes can play a role throughout the entire process. The use of architectural prototypes is illustrated by three distinct cases of creating software systems. We argue that architectural prototyping can provide key insights...

  19. Processor arrays with asynchronous TDM optical buses

    Science.gov (United States)

    Li, Y.; Zheng, S. Q.

    1997-04-01

    We propose a pipelined asynchronous time division multiplexing optical bus. Such a bus can use one of the two hardwared priority schemes, the linear priority scheme and the round-robin priority scheme. Our simulation results show that the performances of our proposed buses are significantly better than the performances of known pipelined synchronous time division multiplexing optical buses. We also propose a class of processor arrays connected by pipelined asynchronous time division multiplexing optical buses. We claim that our proposed processor array not only have better performance, but also have better scalabilities than the existing processor arrays connected by pipelined synchronous time division multiplexing optical buses.

  20. Accelerate Climate Models with the IBM Cell Processor

    Science.gov (United States)

    Zhou, S.; Duffy, D.; Clune, T.; Williams, S.; Suarez, M.; Halem, M.

    2008-12-01

    Ever increasing model resolutions and physical processes in climate models demand continual computing power increases. The IBM Cell processor's order-of- magnitude peak performance increase over conventional processors makes it very attractive for fulfilling this requirement. However, the Cell's characteristics: 256KB local memory per SPE and the new low-level communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics components (~50% total computation time), (2) has a high computational load relative to data traffic to/from main memory, and (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran code) to C and ported it to an IBM BladeCenter QS20, manually SIMDizing 4 independent columns, and found that a Cell with 8 SPEs can process more than 3000 columns per second. Compared with the baseline results, the Cell is ~6.76x, ~8.91x, ~9.85x faster than a core on Intel's Xeon Woodcrest, Dempsey, and Itanium2 respectively. Our analysis shows that the Cell could also speed up the dynamics component (~25% total computation time). We believe this dramatic performance improvement makes the Cell processor very competitive, at least as an accelerator. We will report our experience in porting both the C and Fortran codes and will discuss our work in porting other climate model components.

  1. Demonstration of two-qubit algorithms with a superconducting quantum processor.

    Science.gov (United States)

    DiCarlo, L; Chow, J M; Gambetta, J M; Bishop, Lev S; Johnson, B R; Schuster, D I; Majer, J; Blais, A; Frunzio, L; Girvin, S M; Schoelkopf, R J

    2009-07-09

    Quantum computers, which harness the superposition and entanglement of physical states, could outperform their classical counterparts in solving problems with technological impact-such as factoring large numbers and searching databases. A quantum processor executes algorithms by applying a programmable sequence of gates to an initialized register of qubits, which coherently evolves into a final state containing the result of the computation. Building a quantum processor is challenging because of the need to meet simultaneously requirements that are in conflict: state preparation, long coherence times, universal gate operations and qubit readout. Processors based on a few qubits have been demonstrated using nuclear magnetic resonance, cold ion trap and optical systems, but a solid-state realization has remained an outstanding challenge. Here we demonstrate a two-qubit superconducting processor and the implementation of the Grover search and Deutsch-Jozsa quantum algorithms. We use a two-qubit interaction, tunable in strength by two orders of magnitude on nanosecond timescales, which is mediated by a cavity bus in a circuit quantum electrodynamics architecture. This interaction allows the generation of highly entangled states with concurrence up to 94 per cent. Although this processor constitutes an important step in quantum computing with integrated circuits, continuing efforts to increase qubit coherence times, gate performance and register size will be required to fulfil the promise of a scalable technology.

  2. Identifying Processor Bottlenecks in Virtual Machine Based Execution of Java Bytecode

    Science.gov (United States)

    Rao, Pradeep; Murakami, Kazuaki

    Despite the prevalence of Java workloads across a variety of processor architectures, there is very little published data on the impact of the various processor design decisions on Java performance. We attribute the lack of data to the large design space resulting from the complexity of the modern superscalar processor and the additional complexities associated with executing Java bytecode using a virtual machine. To address this shortcoming, we use a statistically rigorous methodology to systematically quantify the the impact of the various processor microarchitecture parameters on Java execution performance. The adopted methodology enables efficient screening of significant factor effects in a large design space consisting of 35 factors (32-billion potential configurations) using merely 72 observations per benchmark application. We quantify and tabulate the significance of each of the 35 factors for 13 benchmark applications. While these tables provide various insights into Java performance, they consistently highlight the performance significance of the instruction delivery mechanism, especially the instruction cache and the ITLB design parameters. Furthermore, these tables enable the architect to identify processor bottlenecks for Java workloads by providing an estimate of the relative impact of various design decisions.

  3. Evaluation of soft-core processors on a Xilinx Virtex-5 field programmable gate array.

    Energy Technology Data Exchange (ETDEWEB)

    Learn, Mark Walter

    2011-04-01

    Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable field programmable gate array (FPGA)-based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hard-core processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA-based soft-core processors for use in future NBA systems: the MicroBlaze (uB), the open-source Leon3, and the licensed Leon3. Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration.

  4. Design of an ultra-low-power digital processor for passive UHF RFID tags

    Institute of Scientific and Technical Information of China (English)

    Shi Wanggen; Zhuang Yiqi; Li Xiaoming; Wang Xianghua; Jin Zhao; Wang Dan

    2009-01-01

    A new architecture of digital processors for passive UHF radio-frequency identification tags is proposed.This architecture is based on ISO/IEC 18000-6C and targeted at ultra-low power consumption.By applying methods like system-level power management,global clock gating and low voltage implementation,the total power of the design is reduced to a few microwatts.In addition,an innovative way for the design of a true RNG is presented,which contributes to both low power and secure data transaction.The digital processor is verified by an integrated FPGA platform and implemented by the Synopsys design kit for ASIC flows.The design fits different CMOS technologies and has been taped out using the 2P4M 0.35μm process of Chartered Semiconductor.

  5. Detector defect correction of medical images on graphics processors

    Science.gov (United States)

    Membarth, Richard; Hannig, Frank; Teich, Jürgen; Litz, Gerhard; Hornegger, Heinz

    2011-03-01

    The ever increasing complexity and power dissipation of computer architectures in the last decade blazed the trail for more power efficient parallel architectures. Hence, such architectures like field-programmable gate arrays (FPGAs) and particular graphics cards attained great interest and are consequently adopted for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, there is little effort to deploy barely computational, but memory intensive applications to graphics hardware. This paper considers a memory intensive detector defect correction pipeline for medical imaging with strict latency requirements. The image pipeline compensates for different effects caused by the detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. So far, dedicated hardware setups with special processors like DSPs were used for such critical processing. We show that this is today feasible with commodity graphics hardware. Using CUDA as programming model, it is demonstrated that the detector defect correction pipeline consisting of more than ten algorithms is significantly accelerated and that a speedup of 20x can be achieved on NVIDIA's Quadro FX 5800 compared to our reference implementation. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead of successive images to the graphics card memory is reduced by 83% using double buffering.

  6. The new efficient multi-beamforming method based on multiple-access register block on a post-fractional filtering architecture

    Science.gov (United States)

    Kang, Jeeun; Kim, Giduck; Yoon, Changhan; Yoo, Yangmo; Song, Tai-Kyong

    2011-03-01

    In medical ultrasound imaging, a multi-beamforming (MBF) method is used for supporting high frame rate imaging or functional imaging where multiple scanlines are reconstructed from a single excitation event. For efficient MBF, a time-sharing technique (i.e., MBF-TS) can be applied. However, the MBF-TS could degrade image quality due to the decreased beamforming frequency. In this paper, the multi-access register-based MBF (MBF-MAR) method running on the post-fractional filtering (PFF) architecture is presented. In PFF-MBF-MAR, instead of lowering beamforming frequency, a multi-access register at each channel is utilized for generating multiple scanlines simultaneously. To evaluate the performance of the proposed PFF-MBF-MAR method, the phantom experiment was conducted where 64- channel pre-beamformed radio-frequency (RF) data were captured from a tissue mimicking phantom by using a modified commercial ultrasound system (SONOLINE G40, Siemens Inc., USA) using a 3-MHz phased array probe. From the phantom experiment, the PFF-MBF-MAR method showed 4.7 dB and 0.6 improvements in the signal-to-noise ratio (SNR) and the contrast-to-noise ratio (CNR), respectively, compared to the PFF-MBF-TS method, while slightly increasing the hardware complexity (<5.2%). The similar results were achieved with the in vivo thyroid data. These results indicate that the proposed PFF-MBF-MAR method can be used for high frame rate imaging or functional imaging without sacrificing image quality while slightly increasing the hardware complexity.

  7. Visualizing the 3D Architecture of Multiple Erythrocytes Infected with Plasmodium at Nanoscale by Focused Ion Beam-Scanning Electron Microscopy

    Science.gov (United States)

    Soares Medeiros, Lia Carolina; De Souza, Wanderley; Jiao, Chengge; Barrabin, Hector; Miranda, Kildare

    2012-01-01

    Different methods for three-dimensional visualization of biological structures have been developed and extensively applied by different research groups. In the field of electron microscopy, a new technique that has emerged is the use of a focused ion beam and scanning electron microscopy for 3D reconstruction at nanoscale resolution. The higher extent of volume that can be reconstructed with this instrument represent one of the main benefits of this technique, which can provide statistically relevant 3D morphometrical data. As the life cycle of Plasmodium species is a process that involves several structurally complex developmental stages that are responsible for a series of modifications in the erythrocyte surface and cytoplasm, a high number of features within the parasites and the host cells has to be sampled for the correct interpretation of their 3D organization. Here, we used FIB-SEM to visualize the 3D architecture of multiple erythrocytes infected with Plasmodium chabaudi and analyzed their morphometrical parameters in a 3D space. We analyzed and quantified alterations on the host cells, such as the variety of shapes and sizes of their membrane profiles and parasite internal structures such as a polymorphic organization of hemoglobin-filled tubules. The results show the complex 3D organization of Plasmodium and infected erythrocyte, and demonstrate the contribution of FIB-SEM for the obtainment of statistical data for an accurate interpretation of complex biological structures. PMID:22432024

  8. Visualizing the 3D architecture of multiple erythrocytes infected with Plasmodium at nanoscale by focused ion beam-scanning electron microscopy.

    Directory of Open Access Journals (Sweden)

    Lia Carolina Soares Medeiros

    Full Text Available Different methods for three-dimensional visualization of biological structures have been developed and extensively applied by different research groups. In the field of electron microscopy, a new technique that has emerged is the use of a focused ion beam and scanning electron microscopy for 3D reconstruction at nanoscale resolution. The higher extent of volume that can be reconstructed with this instrument represent one of the main benefits of this technique, which can provide statistically relevant 3D morphometrical data. As the life cycle of Plasmodium species is a process that involves several structurally complex developmental stages that are responsible for a series of modifications in the erythrocyte surface and cytoplasm, a high number of features within the parasites and the host cells has to be sampled for the correct interpretation of their 3D organization. Here, we used FIB-SEM to visualize the 3D architecture of multiple erythrocytes infected with Plasmodium chabaudi and analyzed their morphometrical parameters in a 3D space. We analyzed and quantified alterations on the host cells, such as the variety of shapes and sizes of their membrane profiles and parasite internal structures such as a polymorphic organization of hemoglobin-filled tubules. The results show the complex 3D organization of Plasmodium and infected erythrocyte, and demonstrate the contribution of FIB-SEM for the obtainment of statistical data for an accurate interpretation of complex biological structures.

  9. Time-Cost Scheduler for Technological and Economic Challenges Related to Customized Cores and General Purpose Processors

    Directory of Open Access Journals (Sweden)

    Munesh Singh Chauhan

    2014-01-01

    Full Text Available With the renewed interest in the customization of embedded processors for applications specific needs, it becomes imperative to understand its viability both economically and technologically thus avoiding pitfalls. Customization and scalability are two terms which are often used synonymously to denote add/ subtract of additional functional units or increase/ decrease of ports in memory register banks in processors. The advantage that comes out of customization is in the improved performance, reduced silicon area and power efficiency. With the option of parameterizing the inclusion/ exclusion of functional units the hardware can be made leaner and thus more energy efficient. Removal of redundant units results in shortening of critical path in circuits. Though the above advantages look significant but customization carries its own pitfalls which often are intractable. Firstly, it carries an immense overhead if performed in General Purpose Processors (GPUs. Changes in the hardware architecture results in code mismatch and thus necessitates ISA (Instruction Set Architecture extensions or at times complete overhaul. Besides, users are often reluctant to adapt to the changes in ISA as it involves additional training. The final death knell may come from the limited commercial use of customized processor thus resulting in economic losses due to under-utilization of production units. Hence a new insight is needed that caters to the utilization of present technological advancements in processor customization but at the same time avoiding adverse economic fallout that comes from blindly forcing customization everywhere. A graded and selective use of customization in consonance with market and user needs is suggested. Therefore, predicting the development course of micro processors in general and embedded processors in particular will benefit businesses to correctly focus on the performance and efficiency of systems that use these processors

  10. Photonics and Fiber Optics Processor Lab

    Data.gov (United States)

    Federal Laboratory Consortium — The Photonics and Fiber Optics Processor Lab develops, tests and evaluates high speed fiber optic network components as well as network protocols. In addition, this...

  11. Radiation Tolerant Software Defined Video Processor Project

    Data.gov (United States)

    National Aeronautics and Space Administration — MaXentric's is proposing a radiation tolerant Software Define Video Processor, codenamed SDVP, for the problem of advanced motion imaging in the space environment....

  12. Application of compiler-assisted multiple instruction rollback recovery to speculative execution

    Science.gov (United States)

    Alewine, N. J.; Fuchs, W. K.; Hwu, W.-M.

    1993-01-01

    Speculative execution is a method to increase instruction level parallelism which can be exploited by both super-scalar and VLIW architectures. The key to a successful general speculation strategy is a repair mechanism to handle mispredicted branches and accurate reporting of exceptions for speculated instructions. Multiple instruction rollback is a technique developed for recovery from transient processor failure. Many of the difficulties encountered during recovery from branch misprediction or from instruction re-execution due to exception in a speculative execution architecture are similar to those encountered during multiple instruction rollback. The applicability of a recently developed compiler-assisted multiple instruction rollback scheme to aid in speculative execution repair is investigated. Extensions to the compiler-assisted scheme to support branch and exception repair are presented along with performance measurements across ten application programs.

  13. Temperature modeling and emulation of an ASIC temperature monitor system for Tightly-Coupled Processor Arrays (TCPAs)

    OpenAIRE

    E. Glocker; S. Boppu; Chen, Q; Schlichtmann, U.; Teich, J.; D. Schmitt-Landsiedel

    2014-01-01

    This contribution provides an approach for emulating the behaviour of an ASIC temperature monitoring system (TMon) during run-time for a tightly-coupled processor array (TCPA) of a heterogeneous invasive multi-tile architecture to be used for FPGA prototyping. It is based on a thermal RC modeling approach. Also different usage scenarios of TCPA are analyzed and compared.

  14. VLSI Processor For Vector Quantization

    Science.gov (United States)

    Tawel, Raoul

    1995-01-01

    Pixel intensities in each kernel compared simultaneously with all code vectors. Prototype high-performance, low-power, very-large-scale integrated (VLSI) circuit designed to perform compression of image data by vector-quantization method. Contains relatively simple analog computational cells operating on direct or buffered outputs of photodetectors grouped into blocks in imaging array, yielding vector-quantization code word for each such block in sequence. Scheme exploits parallel-processing nature of vector-quantization architecture, with consequent increase in speed.

  15. A New Echeloned Poisson Series Processor (EPSP)

    Science.gov (United States)

    Ivanova, Tamara

    2001-07-01

    A specialized Echeloned Poisson Series Processor (EPSP) is proposed. It is a typical software for the implementation of analytical algorithms of Celestial Mechanics. EPSP is designed for manipulating long polynomial-trigonometric series with literal divisors. The coefficients of these echeloned series are the rational or floating-point numbers. The Keplerian processor and analytical generator of special celestial mechanics functions based on the EPSP are also developed.

  16. MAP3D: a media processor approach for high-end 3D graphics

    Science.gov (United States)

    Darsa, Lucia; Stadnicki, Steven; Basoglu, Chris

    1999-12-01

    Equator Technologies, Inc. has used a software-first approach to produce several programmable and advanced VLIW processor architectures that have the flexibility to run both traditional systems tasks and an array of media-rich applications. For example, Equator's MAP1000A is the world's fastest single-chip programmable signal and image processor targeted for digital consumer and office automation markets. The Equator MAP3D is a proposal for the architecture of the next generation of the Equator MAP family. The MAP3D is designed to achieve high-end 3D performance and a variety of customizable special effects by combining special graphics features with high performance floating-point and media processor architecture. As a programmable media processor, it offers the advantages of a completely configurable 3D pipeline--allowing developers to experiment with different algorithms and to tailor their pipeline to achieve the highest performance for a particular application. With the support of Equator's advanced C compiler and toolkit, MAP3D programs can be written in a high-level language. This allows the compiler to successfully find and exploit any parallelism in a programmer's code, thus decreasing the time to market of a given applications. The ability to run an operating system makes it possible to run concurrent applications in the MAP3D chip, such as video decoding while executing the 3D pipelines, so that integration of applications is easily achieved--using real-time decoded imagery for texturing 3D objects, for instance. This novel architecture enables an affordable, integrated solution for high performance 3D graphics.

  17. A two-qubit photonic quantum processor and its application to solving systems of linear equations

    OpenAIRE

    Stefanie Barz; Ivan Kassal; Martin Ringbauer; Yannick Ole Lipp; Borivoje Dakić; Alán Aspuru-Guzik; Philip Walther

    2014-01-01

    Large-scale quantum computers will require the ability to apply long sequences of entangling gates to many qubits. In a photonic architecture, where single-qubit gates can be performed easily and precisely, the application of consecutive two-qubit entangling gates has been a significant obstacle. Here, we demonstrate a two-qubit photonic quantum processor that implements two consecutive CNOT gates on the same pair of polarisation-encoded qubits. To demonstrate the flexibility of our system, w...

  18. Periodic activity migration for fast sequential execution in future heterogeneous multicore processors

    OpenAIRE

    Michaud, Pierre

    2008-01-01

    On each new technology generation, miniaturization permits putting twice as many computing cores on the same silicon area, potentially doubling the processor performance. However, if sequential execution is not accelerated at the same time, Amdahl's law will eventually limit the actual performance. Hence it will be beneficial to have asymmetric multicores where some cores are specialized for fast sequential execution. This specialization may be achieved by architectural means, but it may also...

  19. SMART AS A CRYPTOGRAPHIC PROCESSOR

    Directory of Open Access Journals (Sweden)

    Saroja Kanchi

    2016-05-01

    Full Text Available SMaRT is a 16-bit 2.5-address RISC-type single-cycle processor, which was recently designed and successfully mapped into a FPGA chip in our ECE department. In this paper, we use SMaRT to run the well-known encryption algorithm, Data Encryption Standard. For information security purposes, encryption is a must in today’s sophisticated and ever-increasing computer communications such as ATM machines and SIM cards. For comparison and evaluation purposes, we also map the same algorithm on the HC12, a same-size but CISC-type off-the-shelf microcontroller, Our results show that compared to HC12, SMaRT code is only 14% longer in terms of the static number of instructions but about 10 times faster in terms of the number of clock cycles, and 7% smaller in terms of code size. Our results also show that 2.5- address instructions, a SMaRT selling point, amount to 45% of the whole R-type instructions resulting in significant improvement in static number of instructions hence code size as well as performance. Additionally, we see that the SMaRT short-branch range is sufficiently wide in 90% of cases in the SMaRT code. Our results also reveal that the SMaRT novel concept of locality of reference in using the MSBs of the registers in non-subroutine branch instructions stays valid with a remarkable hit rate of 95%!

  20. Experiences with Compiler Support for Processors with Exposed Pipelines

    DEFF Research Database (Denmark)

    Jensen, Nicklas Bo; Schleuniger, Pascal; Hindborg, Andreas Erik;

    2015-01-01

    Field programmable gate arrays, FPGAs, have become an attractive implementation technology for a broad range of computing systems. We recently proposed a processor architecture, Tinuso, which achieves high performance by moving complexity from hardware to the compiler tool chain. This means...... that the compiler tool chain must handle the increased complexity. However, it is not clear if current production compilers can successfully meet the strict constraints on instruction order and generate efficient object code. In this paper, we present our experiences developing a compiler backend using the GNU...... Compiler Collection, GCC. For a set of C benchmarks, we show that a Tinuso implementation with our GCC backend reaches a relative speedup of up to 1.73 over a similar Xilinx Micro Blaze configuration while using 30% fewer hardware resources. While our experiences are generally positive, we expose some...

  1. Airborne ocean water lidar (OWL) real time processor (RTP)

    Science.gov (United States)

    Hryszko, M.

    1995-03-01

    The Hyperflo Real Time Processor (RTP) was developed by Pacific-Sierra Research Corporation as a part of the Naval Air Warfare Center's Ocean Water Lidar (OWL) system. The RTP was used for real time support of open ocean field tests at Barbers Point, Hawaii, in March 1993 (EMERALD I field test), and Jacksonville, Florida, in July 1994 (EMERALD I field test). This report describes the system configuration, and accomplishments associated with the preparation and execution of these exercises. This document is intended to supplement the overall test reports and provide insight into the development and use of the PTP. A secondary objective is to provide basic information on the capabilities, versatility and expandability of the Hyperflo RTP for possible future projects. It is assumed herein that the reader has knowledge of the OWL system, field test operations, general lidar processing methods, and basic computer architecture.

  2. Reconfigurable FFT Processor – A Broader Perspective Survey

    Directory of Open Access Journals (Sweden)

    V.Sarada

    2013-04-01

    Full Text Available The FFT(Fast Fourier Transform processing is one of the key procedure in the popular orthogonal frequency division multiplexing(OFDM based communication system such as Digital AudioBroadcasting(DAB,Digital Video Broadcasting Terrestrial(DVB- T,Asymmetric Digital Subscriber Loop(ADSL etc.These application domain require performing FFT in various size from 64 to 8192 point. Implementing each FFT on a dedicated IP presents a great overhead in silicon area of the chip. By supporting the different sizes of FFT for new wireless telecommunication standard may increase the time to market it. This consideration make FFT ideal candidate for reconfigurable implementation. Efficient implementation of the FFT processor with small area, low power and speed is very important. This survey paper aims at a study on efficient algorithm and architecture for reconfigurable FFT design and observes common traits of the good contribution.

  3. How to Safely Integrate Multiple Applications on Embedded Many-Core Systems by Applying the “Correctness by Construction” Principle

    Directory of Open Access Journals (Sweden)

    Robert Hilbrich

    2012-01-01

    Full Text Available Software-intensive embedded systems, especially cyber-physical systems, benefit from the additional performance and the small power envelope offered by many-core processors. Nevertheless, the adoption of a massively parallel processor architecture in the embedded domain is still challenging. The integration of multiple and potentially parallel functions on a chip—instead of just a single function—makes best use of the resources offered. However, this multifunction approach leads to new technical and nontechnical challenges during the integration. This is especially the case for a distributed system architecture, which is subject to specific safety considerations. In this paper, it is argued that these challenges cannot be effectively addressed with traditional engineering approaches. Instead, the application of the “correctness by construction” principle is proposed to improve the integration process.

  4. New bus architecture for distributed avionic systems

    Energy Technology Data Exchange (ETDEWEB)

    Leonard, W.B.; Chow, K.K.

    1983-01-01

    The authors discuss a bus architecture which offers several advantages over conventional buses for a large selection of applications, and which allows for both interrupts and error checking. The bus may be expanded in the form of a tree with many processors, memories, and input/output devices distributed freely throughout the system. Therefore, it is inherently a multiprocessor bus and is well suited for systems in which a central controller controls several subordinate processors and devices. Its low overhead plus large capacity for expansion make it appropriate for a wide range of future systems.

  5. MC64-ClustalWP2: a highly-parallel hybrid strategy to align multiple sequences in many-core architectures.

    Directory of Open Access Journals (Sweden)

    David Díaz

    Full Text Available We have developed the MC64-ClustalWP2 as a new implementation of the Clustal W algorithm, integrating a novel parallelization strategy and significantly increasing the performance when aligning long sequences in architectures with many cores. It must be stressed that in such a process, the detailed analysis of both the software and hardware features and peculiarities is of paramount importance to reveal key points to exploit and optimize the full potential of parallelism in many-core CPU systems. The new parallelization approach has focused into the most time-consuming stages of this algorithm. In particular, the so-called progressive alignment has drastically improved the performance, due to a fine-grained approach where the forward and backward loops were unrolled and parallelized. Another key approach has been the implementation of the new algorithm in a hybrid-computing system, integrating both an Intel Xeon multi-core CPU and a Tilera Tile64 many-core card. A comparison with other Clustal W implementations reveals the high-performance of the new algorithm and strategy in many-core CPU architectures, in a scenario where the sequences to align are relatively long (more than 10 kb and, hence, a many-core GPU hardware cannot be used. Thus, the MC64-ClustalWP2 runs multiple alignments more than 18x than the original Clustal W algorithm, and more than 7x than the best x86 parallel implementation to date, being publicly available through a web service. Besides, these developments have been deployed in cost-effective personal computers and should be useful for life-science researchers, including the identification of identities and differences for mutation/polymorphism analyses, biodiversity and evolutionary studies and for the development of molecular markers for paternity testing, germplasm management and protection, to assist breeding, illegal traffic control, fraud prevention and for the protection of the intellectual property (identification

  6. Associative Memory design for the FastTrack processor (FTK) at ATLAS

    CERN Document Server

    Annovi, A; The ATLAS collaboration; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Piendibene, M; Sacco, I; Sartori, L; Tripiccione, R

    2010-01-01

    We propose a new generation of VLSI processor for pattern recognition based on Associative Memory architecture, optimized for on-line track finding in high-energy physics experiments. We describe the architecture, the technology studies and the prototype design of a new R&D Associative Memory project: it maximizes the pattern density on ASICs, minimizes the power consumption and improves the functionality for the Fast Tracker (FTK) proposed to upgrade the ATLAS trigger at LHC. Finally we will focus on possible future applications inside and outside High Physics Energy (HEP).

  7. Highly scalable linear solvers on thousands of processors.

    Energy Technology Data Exchange (ETDEWEB)

    Domino, Stefan Paul (Sandia National Laboratories, Albuquerque, NM); Karlin, Ian (University of Colorado at Boulder, Boulder, CO); Siefert, Christopher (Sandia National Laboratories, Albuquerque, NM); Hu, Jonathan Joseph; Robinson, Allen Conrad (Sandia National Laboratories, Albuquerque, NM); Tuminaro, Raymond Stephen

    2009-09-01

    In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.

  8. Asymptotic teleportation scheme as a universal programmable quantum processor.

    Science.gov (United States)

    Ishizaka, Satoshi; Hiroshima, Tohya

    2008-12-12

    We consider a scheme of quantum teleportation where a receiver has multiple (N) output ports and obtains the teleported state by merely selecting one of the N ports according to the outcome of the sender's measurement. We demonstrate that such teleportation is possible by showing an explicit protocol where N pairs of maximally entangled qubits are employed. The optimal measurement performed by a sender is the square-root measurement, and a perfect teleportation fidelity is asymptotically achieved for a large N limit. Such asymptotic teleportation can be utilized as a universal programmable processor.

  9. Negative base encoding in optical linear algebra processors

    Science.gov (United States)

    Perlee, C.; Casasent, D.

    1986-01-01

    In the digital multiplication by analog convolution algorithm, the bits of two encoded numbers are convolved to form the product of the two numbers in mixed binary representation; this output can be easily converted to binary. Attention is presently given to negative base encoding, treating base -2 initially, and then showing that the negative base system can be readily extended to any radix. In general, negative base encoding in optical linear algebra processors represents a more efficient technique than either sign magnitude or 2's complement encoding, when the additions of digitally encoded products are performed in parallel.

  10. Implementing High Performance Lexical Analyzer using CELL Broadband Engine Processor

    Directory of Open Access Journals (Sweden)

    P.J.SATHISH KUMAR

    2011-09-01

    Full Text Available The lexical analyzer is the first phase of the compiler and commonly the most time consuming. The compilation of large programs is still far from optimized in today’s compilers. With modern processors moving more towards improving parallelization and multithreading, it has become impossible for performance gains in older compilersas technology advances. Any multicore architecture relies on improving parallelism than on improving single core performance. A compiler that is completely parallel and optimized is yet to be developed and would require significant effort to create. On careful analysis we find that the performance of a compiler is majorly affected by the lexical analyzer’s scanning and tokenizing phases. This effort is directed towards the creation of a completelyparallelized lexical analyzer designed to run on the Cell/B.E. processor that utilizes its multicore functionalities to achieve high performance gains in a compiler. Each SPE reads a block of data from the input and tokenizes them independently. To prevent dependence of SPE’s, a scheme for dynamically extending static block-limits isincorporated. Each SPE is given a range which it initially scans and then finalizes its input buffer to a set of complete tokens from the range dynamically. This ensures parallelization of the SPE’s independently and dynamically, with the PPE scheduling load for each SPE. The initially static assignment of the code blocks is made dynamic as soon as one SPE commits. This aids SPE load distribution and balancing. The PPE maintains the output buffer until all SPE’s of a single stage commit and move to the next stage before being written out to the file, to maintain order of execution. The approach can be extended easily to other multicore architectures as well. Tokenization is performed by high-speed string searching, with the keyword dictionary of the language, using Aho-Corasick algorithm.

  11. Robotic architectures

    CSIR Research Space (South Africa)

    Mtshali, M

    2010-01-01

    Full Text Available In the development of mobile robotic systems, a robotic architecture plays a crucial role in interconnecting all the sub-systems and controlling the system. The design of robotic architectures for mobile autonomous robots is a challenging...

  12. Robotic Architectures

    Directory of Open Access Journals (Sweden)

    Mbali Mtshali

    2010-01-01

    Full Text Available In the development of mobile robotic systems, a robotic architecture plays a crucial role in interconnecting all the sub-systems and controlling the system. The design of robotic architectures for mobile autonomous robots is a challenging and complex task. With a number of existing architectures and tools to choose from, a review of the existing robotic architecture is essential. This paper surveys the different paradigms in robotic architectures. A classification of the existing robotic architectures and comparison of different proposals attributes and properties have been carried out. The paper also provides a view on the current state of designing robot architectures. It also proposes a conceptual model of a generalised robotic architecture for mobile autonomous robots.Defence Science Journal, 2010, 60(1, pp.15-22, DOI:http://dx.doi.org/10.14429/dsj.60.96

  13. Energy-efficient communication processors design and implementation for emerging wireless systems

    CERN Document Server

    Fasthuber, Robert; Raghavan, Praveen; Naessens, Frederik

    2013-01-01

    This book describes a new design approach for energy-efficient, Domain-Specific Instruction set Processor (DSIP) architectures for the wireless baseband domain. The innovative techniques presented enable co-design of algorithms, architectures and technology, for efficient implementation of the most advanced technologies. To demonstrate the feasibility of the author’s design approach, case studies are included for crucial functionality of advanced wireless systems with increased computational performance, flexibility and reusability. Designers using this approach will benefit from reduced development/product costs and greater scalability to future process technology nodes. Describes a DSIP architecture explicitly for the wireless domain, significantly more efficient than methods commonly in use; Includes an efficient DSIP architecture template, which can be reused for specific designs; Uses holistic design approach, considering all relevant requirements and combining many innovative/disruptive design concept...

  14. Architecture & Environment

    Science.gov (United States)

    Erickson, Mary; Delahunt, Michael

    2010-01-01

    Most art teachers would agree that architecture is an important form of visual art, but they do not always include it in their curriculums. In this article, the authors share core ideas from "Architecture and Environment," a teaching resource that they developed out of a long-term interest in teaching architecture and their fascination with the…

  15. Architecture & Environment

    Science.gov (United States)

    Erickson, Mary; Delahunt, Michael

    2010-01-01

    Most art teachers would agree that architecture is an important form of visual art, but they do not always include it in their curriculums. In this article, the authors share core ideas from "Architecture and Environment," a teaching resource that they developed out of a long-term interest in teaching architecture and their fascination with the…

  16. A project of universal computing platform - cluster of floating point DSP processors (Projekt uniwersalnej platformy obliczeniowej - klastra zmiennoprzecinkowych procesorów DSP)

    CERN Document Server

    Dymanowski, L; Linczuk, M

    2009-01-01

    In this paper, a project of DSP processors cluster is presented. This project is realized as an extension board for PC computers. A block diagram of the board is described. A DSP processor properties for cluster computation was described. The aim is to use a number of such boards for building a cluster of DSP clusters. Such architecture will be used for High Energy Physics Experiments results calculations with such data as CMS, ILC and E-XFEL.

  17. Multi-Rate Secure Processor Terminal Architecture Study. Volume 1. Terminal Architecture.

    Science.gov (United States)

    1981-06-01

    drift. In this method the weights of the equalizer are monitored to detecr. lateral motion due to symbol timing drift. A VCO, controlling the receiver...Terminal Controller is contained in Volume II (classified) of this report. 3.5 Mechanical Packaging Concept The mechanical packaging aproach for the...The primary I/O method intended for the HMSP is direct memory access (DMA). A polite form of DMA is utilized which incurs no overhead. This is

  18. Bounded budgeted parallel architecture versus control dominated architecture for hazard data-signal processor synthesis

    Science.gov (United States)

    Le Gal, Bertrand; Casseau, Emmanuel; Martin, Eric

    2005-06-01

    Multimedia applications such as video and image processing are often characterized by a large number of data accesses (i.e. RAM accesses). In many digital signal-processing applications, the array access patterns are regular and periodic. In these cases, optimized Pipelined Memory Access Controllers can be generated. This technique is used to improve the pipeline access mode to RAM by creating specialized hardware components for generating addresses and packing and unpacking data items. In this paper we focus on the design, implementation and validation of memory interfacing modules that can be automatically generated from a behavioural synthesis tool and which can efficiently handle predictable address patterns as well as unpredictable ones (dynamic address computations) in a pipeline way. We also analyze the benefits of balancing dynamic address computations from datapath to specialized computation units placed in the memory controller, optimizing bitwise of operators and data locality i.e. reducing the power consumption.

  19. IDSP- INTERACTIVE DIGITAL SIGNAL PROCESSOR

    Science.gov (United States)

    Mish, W. H.

    1994-01-01

    The Interactive Digital Signal Processor, IDSP, consists of a set of time series analysis "operators" based on the various algorithms commonly used for digital signal analysis work. The processing of a digital time series to extract information is usually achieved by the application of a number of fairly standard operations. However, it is often desirable to "experiment" with various operations and combinations of operations to explore their effect on the results. IDSP is designed to provide an interactive and easy-to-use system for this type of digital time series analysis. The IDSP operators can be applied in any sensible order (even recursively), and can be applied to single time series or to simultaneous time series. IDSP is being used extensively to process data obtained from scientific instruments onboard spacecraft. It is also an excellent teaching tool for demonstrating the application of time series operators to artificially-generated signals. IDSP currently includes over 43 standard operators. Processing operators provide for Fourier transformation operations, design and application of digital filters, and Eigenvalue analysis. Additional support operators provide for data editing, display of information, graphical output, and batch operation. User-developed operators can be easily interfaced with the system to provide for expansion and experimentation. Each operator application generates one or more output files from an input file. The processing of a file can involve many operators in a complex application. IDSP maintains historical information as an integral part of each file so that the user can display the operator history of the file at any time during an interactive analysis. IDSP is written in VAX FORTRAN 77 for interactive or batch execution and has been implemented on a DEC VAX-11/780 operating under VMS. The IDSP system generates graphics output for a variety of graphics systems. The program requires the use of Versaplot and Template plotting

  20. THOR Field and Wave Processor - FWP

    Science.gov (United States)

    Soucek, Jan; Rothkaehl, Hanna; Balikhin, Michael; Zaslavsky, Arnaud; Nakamura, Rumi; Khotyaintsev, Yuri; Uhlir, Ludek; Lan, Radek; Yearby, Keith; Morawski, Marek; Winkler, Marek

    2016-04-01

    If selected, Turbulence Heating ObserveR (THOR) will become the first mission ever flown in space dedicated to plasma turbulence. The Fields and Waves Processor (FWP) is an integrated electronics unit for all electromagnetic field measurements performed by THOR. FWP will interface with all fields sensors: electric field antennas of the EFI instrument, the MAG fluxgate magnetometer and search-coil magnetometer (SCM) and perform data digitization and on-board processing. FWP box will house multiple data acquisition sub-units and signal analyzers all sharing a common power supply and data processing unit and thus a single data and power interface to the spacecraft. Integrating all the electromagnetic field measurements in a single unit will improve the consistency of field measurement and accuracy of time synchronization. The feasibility of making highly sensitive electric and magnetic field measurements in space has been demonstrated by Cluster (among other spacecraft) and THOR instrumentation complemented by a thorough electromagnetic cleanliness program will further improve on this heritage. Taking advantage of the capabilities of modern electronics, FWP will provide simultaneous synchronized waveform and spectral data products at high time resolution from the numerous THOR sensors, taking advantage of the large telemetry bandwidth of THOR. FWP will also implement a plasma a resonance sounder and a digital plasma quasi-thermal noise analyzer designed to provide high cadence measurements of plasma density and temperature complementary to data from particle instruments. FWP will be interfaced with the particle instrument data processing unit (PPU) via a dedicated digital link which will enable performing on board correlation between waves and particles, quantifying the transfer of energy between waves and particles. The FWP instrument shall be designed and built by an international consortium of scientific institutes from Czech Republic, Poland, France, UK, Sweden