WorldWideScience

Sample records for monolithic parallel processor

  1. Fast, Massively Parallel Data Processors

    Science.gov (United States)

    Heaton, Robert A.; Blevins, Donald W.; Davis, ED

    1994-01-01

    Proposed fast, massively parallel data processor contains 8x16 array of processing elements with efficient interconnection scheme and options for flexible local control. Processing elements communicate with each other on "X" interconnection grid with external memory via high-capacity input/output bus. This approach to conditional operation nearly doubles speed of various arithmetic operations.

  2. Ultrafast Fourier-transform parallel processor

    Energy Technology Data Exchange (ETDEWEB)

    Greenberg, W.L.

    1980-04-01

    A new, flexible, parallel-processing architecture is developed for a high-speed, high-precision Fourier transform processor. The processor is intended for use in 2-D signal processing including spatial filtering, matched filtering and image reconstruction from projections.

  3. Adapting implicit methods to parallel processors

    Energy Technology Data Exchange (ETDEWEB)

    Reeves, L.; McMillin, B.; Okunbor, D.; Riggins, D. [Univ. of Missouri, Rolla, MO (United States)

    1994-12-31

    When numerically solving many types of partial differential equations, it is advantageous to use implicit methods because of their better stability and more flexible parameter choice, (e.g. larger time steps). However, since implicit methods usually require simultaneous knowledge of the entire computational domain, these methods axe difficult to implement directly on distributed memory parallel processors. This leads to infrequent use of implicit methods on parallel/distributed systems. The usual implementation of implicit methods is inefficient due to the nature of parallel systems where it is common to take the computational domain and distribute the grid points over the processors so as to maintain a relatively even workload per processor. This creates a problem at the locations in the domain where adjacent points are not on the same processor. In order for the values at these points to be calculated, messages have to be exchanged between the corresponding processors. Without special adaptation, this will result in idle processors during part of the computation, and as the number of idle processors increases, the lower the effective speed improvement by using a parallel processor.

  4. Broadband monitoring simulation with massively parallel processors

    Science.gov (United States)

    Trubetskov, Mikhail; Amotchkina, Tatiana; Tikhonravov, Alexander

    2011-09-01

    Modern efficient optimization techniques, namely needle optimization and gradual evolution, enable one to design optical coatings of any type. Even more, these techniques allow obtaining multiple solutions with close spectral characteristics. It is important, therefore, to develop software tools that can allow one to choose a practically optimal solution from a wide variety of possible theoretical designs. A practically optimal solution provides the highest production yield when optical coating is manufactured. Computational manufacturing is a low-cost tool for choosing a practically optimal solution. The theory of probability predicts that reliable production yield estimations require many hundreds or even thousands of computational manufacturing experiments. As a result reliable estimation of the production yield may require too much computational time. The most time-consuming operation is calculation of the discrepancy function used by a broadband monitoring algorithm. This function is formed by a sum of terms over wavelength grid. These terms can be computed simultaneously in different threads of computations which opens great opportunities for parallelization of computations. Multi-core and multi-processor systems can provide accelerations up to several times. Additional potential for further acceleration of computations is connected with using Graphics Processing Units (GPU). A modern GPU consists of hundreds of massively parallel processors and is capable to perform floating-point operations efficiently.

  5. Speed Scaling on Parallel Processors with Migration

    CERN Document Server

    Angel, Eric; Kacem, Fadi; Letsios, Dimitrios

    2011-01-01

    We study the problem of scheduling a set of jobs with release dates, deadlines and processing requirements (or works), on parallel speed-scaled processors so as to minimize the total energy consumption. We consider that both preemption and migration of jobs are allowed. An exact polynomial-time algorithm has been proposed for this problem, which is based on the Ellipsoid algorithm. Here, we formulate the problem as a convex program and we propose a simpler polynomial-time combinatorial algorithm which is based on a reduction to the maximum flow problem. Our algorithm runs in $O(nf(n)logP)$ time, where $n$ is the number of jobs, $P$ is the range of all possible values of processors' speeds divided by the desired accuracy and $f(n)$ is the complexity of computing a maximum flow in a layered graph with O(n) vertices. Independently, Albers et al. \\cite{AAG11} proposed an $O(n^2f(n))$-time algorithm exploiting the same relation with the maximum flow problem. We extend our algorithm to the multiprocessor speed scal...

  6. Programming massively parallel processors a hands-on approach

    CERN Document Server

    Kirk, David B

    2010-01-01

    Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs. Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA. The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who ...

  7. Processor Allocation for Optimistic Parallelization of Irregular Programs

    CERN Document Server

    Versaci, Francesco

    2012-01-01

    Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution of the irregular algorithm. Therefore, the processor allocation problem for irregular algorithms is very difficult. In this paper, we describe the first systematic strategy for addressing this pro...

  8. Fast Parallel Computation of Polynomials Using Few Processors

    DEFF Research Database (Denmark)

    Valiant, Leslie G.; Skyum, Sven; Berkowitz, S.;

    1983-01-01

    It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors.......It is shown that any multivariate polynomial of degree $d$ that can be computed sequentially in $C$ steps can be computed in parallel in $O((\\log d)(\\log C + \\log d))$ steps using only $(Cd)^{O(1)} $ processors....

  9. Fast parallel computation of polynomials using few processors

    DEFF Research Database (Denmark)

    Valiant, Leslie; Skyum, Sven

    1981-01-01

    It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors.......It is shown that any multivariate polynomial that can be computed sequentially in C steps and has degree d can be computed in parallel in 0((log d) (log C + log d)) steps using only (Cd)0(1) processors....

  10. SLAPP: A systolic linear algebra parallel processor

    Energy Technology Data Exchange (ETDEWEB)

    Drake, B.L.; Luk, F.T.; Speiser, J.M.; Symanski, J.J. (Naval Ocean Systems Center and Cornell Univ.)

    1987-07-01

    Systolic array computer architectures provide a means for fast computation of the linear algebra algorithms that form the building blocks of many signal-processing algorithms, facilitating their real-time computation. For applications to signal processing, the systolic array operates on matrices, an inherently parallel view of the data, using numerical linear algebra algorithms that have been suitably parallelized to efficiently utilize the available hardware. This article describes work currently underway at the Naval Ocean Systems Center, San Diego, California, to build a two-dimensional systolic array, SLAPP, demonstrating efficient and modular parallelization of key matric computations for real-time signal- and image-processing problems.

  11. Parallel processor simulator for multiple optic channel architectures

    Science.gov (United States)

    Wailes, Tom S.; Meyer, David G.

    1992-12-01

    A parallel processing architecture based on multiple channel optical communication is described and compared with existing interconnection strategies for parallel computers. The proposed multiple channel architecture (MCA) uses MQW-DBR lasers to provide a large number of independent, selectable channels (or virtual buses) for data transport. Arbitrary interconnection patterns as well as machine partitions can be emulated via appropriate channel assignments. Hierarchies of parallel architectures and simultaneous execution of parallel tasks are also possible. Described are a basic overview of the proposed architecture, various channel allocation strategies that can be utilized by the MCA, and a summary of advantages of the MCA compared with traditional interconnection techniques. Also describes is a comprehensive multiple processor simulator that has been developed to execute parallel algorithms using the MCA as a data transport mechanism between processors and memory units. Simulation results -- including average channel load, effective channel utilization, and average network latency for different algorithms and different transmission speeds -- are also presented.

  12. Global synchronization of parallel processors using clock pulse width modulation

    Science.gov (United States)

    Chen, Dong; Ellavsky, Matthew R.; Franke, Ross L.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Jeanson, Mark J.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Littrell, Daniel; Ohmacht, Martin; Reed, Don D.; Schenck, Brandon E.; Swetz, Richard A.

    2013-04-02

    A circuit generates a global clock signal with a pulse width modification to synchronize processors in a parallel computing system. The circuit may include a hardware module and a clock splitter. The hardware module may generate a clock signal and performs a pulse width modification on the clock signal. The pulse width modification changes a pulse width within a clock period in the clock signal. The clock splitter may distribute the pulse width modified clock signal to a plurality of processors in the parallel computing system.

  13. Scientific programming on massively parallel processor CP-PACS

    Energy Technology Data Exchange (ETDEWEB)

    Boku, Taisuke [Tsukuba Univ., Ibaraki (Japan). Inst. of Information Sciences and Electronics

    1998-03-01

    The massively parallel processor CP-PACS takes various problems of calculation physics as the object, and it has been designed so that its architecture has been devised to do various numerical processings. In this report, the outline of the CP-PACS and the example of programming in the Kernel CG benchmark in NAS Parallel Benchmarks, version 1, are shown, and the pseudo vector processing mechanism and the parallel processing tuning of scientific and technical computation utilizing the three-dimensional hyper crossbar net, which are two great features of the architecture of the CP-PACS are described. As for the CP-PACS, the PUs based on RISC processor and added with pseudo vector processor are used. Pseudo vector processing is realized as the loop processing by scalar command. The features of the connection net of PUs are explained. The algorithm of the NPB version 1 Kernel CG is shown. The part that takes the time for processing most in the main loop is the product of matrix and vector (matvec), and the parallel processing of the matvec is explained. The time for the computation by the CPU is determined. As the evaluation of the performance, the evaluation of the time for execution, the short vector processing of pseudo vector processor based on slide window, and the comparison with other parallel computers are reported. (K.I.)

  14. APEmille a parallel processor in the teraflop range

    CERN Document Server

    Panizzi, E

    1996-01-01

    APEmille is a SIMD parallel processor under development at the Italian National Institute for Nuclear Physics (INFN). APEmille is very well suited for Lattice QCD applications, both for its hardware characteristics and for its software and language features. APEmille is an array of custom arithmetic processors arranged on a tridimensional torus. The replicated processor is a pipelined VLIW device performing integer and single/double precision IEEE floating point operations. The processor is optimized for complex computations and has a peak performance of 528Mflop at 66MHz and of 800Mflop at 100MHz. In principle an array of 2048 nodes is able to break the Tflops barrier. A powerful programming language named TAO is provided and is highly optimized for QCD. A C++ compiler is foreseen. Specific data structures, operators and even statements can be defined by the user for each different application. Effort has been made to define the language constructs for QCD.

  15. Parallel Processor for 3D Recovery from Optical Flow

    Directory of Open Access Journals (Sweden)

    Jose Hugo Barron-Zambrano

    2009-01-01

    Full Text Available 3D recovery from motion has received a major effort in computer vision systems in the recent years. The main problem lies in the number of operations and memory accesses to be performed by the majority of the existing techniques when translated to hardware or software implementations. This paper proposes a parallel processor for 3D recovery from optical flow. Its main feature is the maximum reuse of data and the low number of clock cycles to calculate the optical flow, along with the precision with which 3D recovery is achieved. The results of the proposed architecture as well as those from processor synthesis are presented.

  16. Directions in parallel processor architecture, and GPUs too

    CERN Document Server

    CERN. Geneva

    2014-01-01

    Modern computing is power-limited in every domain of computing. Performance increments extracted from instruction-level parallelism (ILP) are no longer power-efficient; they haven't been for some time. Thread-level parallelism (TLP) is a more easily exploited form of parallelism, at the expense of programmer effort to expose it in the program. In this talk, I will introduce you to disparate topics in parallel processor architecture that will impact programming models (and you) in both the near and far future. About the speaker Olivier is a senior GPU (SM) architect at NVIDIA and an active participant in the concurrency working group of the ISO C++ committee. He has also worked on very large diesel engines as a mechanical engineer, and taught at McGill University (Canada) as a faculty instructor.

  17. QPACE -- a QCD parallel computer based on Cell processors

    CERN Document Server

    Baier, H; Drochner, M; Eicker, N; Fischer, U; Fodor, Z; Frommer, A; Gomez, C; Goldrian, G; Heybrock, S; Hierl, D; Hüsken, M; Huth, T; Krill, B; Lauritsen, J; Lippert, T; Maurer, T; Meyer, N; Nobile, A; Ouda, I; Pivanti, M; Pleiter, D; Schäfer, A; Schick, H; Schifano, F; Simma, H; Solbrig, S; Streuer, T; Sulanke, K -H; Tripiccione, R; Vogt, J -S; Wettig, T; Winter, F

    2009-01-01

    QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.

  18. Parallel information transfer in a multinode quantum information processor.

    Science.gov (United States)

    Borneman, T W; Granade, C E; Cory, D G

    2012-04-06

    We describe a method for coupling disjoint quantum bits (qubits) in different local processing nodes of a distributed node quantum information processor. An effective channel for information transfer between nodes is obtained by moving the system into an interaction frame where all pairs of cross-node qubits are effectively coupled via an exchange interaction between actuator elements of each node. All control is achieved via actuator-only modulation, leading to fast implementations of a universal set of internode quantum gates. The method is expected to be nearly independent of actuator decoherence and may be made insensitive to experimental variations of system parameters by appropriate design of control sequences. We show, in particular, how the induced cross-node coupling channel may be used to swap the complete quantum states of the local processors in parallel.

  19. Distributing and Scheduling Divisible Task on Parallel Communicating Processors

    Institute of Scientific and Technical Information of China (English)

    李国东; 张德富

    2002-01-01

    In this paper we propose a novel scheme for scheduling divisible task onparallel processors connected by system interconnection network with arbitrary topology. Thedivisible task is a computation that can be divided into arbitrary independent subtasks solvedin parallel. Our model takes into consideration communication initial time and communicationdelays between processors. Moreover, by constructing the corresponding Network SpanningTree (NST) for a network, our scheme can be applied to all kinds of network topologies. Wepresent the concept of Balanced Task Distribution Tree and use it to design the Equation SetCreation Algorithm in which the set of linear equations is created by traversing the NST inpost-order. After solving the created equations, we get the optimal task assignment scheme.Experiments confirm the applicability of our scheme in real-life situations.

  20. Parallel BDD-based monolithic approach for acoustic fluid-structure interaction

    Science.gov (United States)

    Minami, Satsuki; Kawai, Hiroshi; Yoshimura, Shinobu

    2012-12-01

    Parallel BDD-based monolithic algorithms for acoustic fluid-structure interaction problems are developed. In a previous study, two schemes, NN-I + CGC-FULL and NN-I + CGC-DIAG, have been proven to be efficient among several BDD-type schemes for one processor. Thus, the parallelization of these schemes is discussed in the present study. These BDD-type schemes consist of the operations of the Schur complement matrix-vector (Sv) product, Neumann-Neumann (NN) preconditioning, and the coarse problem. In the present study, the Sv product and NN preconditioning are parallelized for both schemes, and the parallel implementation of the solid and fluid parts of the coarse problem is considered for NN-I + CGC-DIAG. The results of numerical experiments indicate that both schemes exhibit performances that are almost as good as those of single solid and fluid analyses in the Sv product and NN preconditioning. Moreover, NN-I + CGC-DIAG appears to become more efficient as the problem size becomes large due to the parallel calculation of the coarse problem.

  1. On program restructuring, scheduling, and communication for parallel processor systems

    Energy Technology Data Exchange (ETDEWEB)

    Polychronopoulos, Constantine D.

    1986-08-01

    This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented. 69 refs., 74 figs., 14 tabs.

  2. An informal introduction to program transformation and parallel processors

    Energy Technology Data Exchange (ETDEWEB)

    Hopkins, K.W. [Southwest Baptist Univ., Bolivar, MO (United States)

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

  3. Monolithic Parallel Tandem Organic Photovoltaic Cell with Transparent Carbon Nanotube Interlayer

    Science.gov (United States)

    Tanaka, S.; Mielczarek, K.; Ovalle-Robles, R.; Wang, B.; Hsu, D.; Zakhidov, A. A.

    2009-01-01

    We demonstrate an organic photovoltaic cell with a monolithic tandem structure in parallel connection. Transparent multiwalled carbon nanotube sheets are used as an interlayer anode electrode for this parallel tandem. The characteristics of front and back cells are measured independently. The short circuit current density of the parallel tandem cell is larger than the currents of each individual cell. The wavelength dependence of photocurrent for the parallel tandem cell shows the superposition spectrum of the two spectral sensitivities of the front and back cells. The monolithic three-electrode photovoltaic cell indeed operates as a parallel tandem with improved efficiency.

  4. Monolithic Parallel Tandem Organic Photovoltaic Cell with Transparent Carbon Nanotube Interlayer

    Science.gov (United States)

    Tanaka, S.; Mielczarek, K.; Ovalle-Robles, R.; Wang, B.; Hsu, D.; Zakhidov, A. A.

    2009-01-01

    We demonstrate an organic photovoltaic cell with a monolithic tandem structure in parallel connection. Transparent multiwalled carbon nanotube sheets are used as an interlayer anode electrode for this parallel tandem. The characteristics of front and back cells are measured independently. The short circuit current density of the parallel tandem cell is larger than the currents of each individual cell. The wavelength dependence of photocurrent for the parallel tandem cell shows the superposition spectrum of the two spectral sensitivities of the front and back cells. The monolithic three-electrode photovoltaic cell indeed operates as a parallel tandem with improved efficiency.

  5. Optimal Parallel Algorithms for Two Processor Scheduling with Tree Precedence Constraints

    OpenAIRE

    Ernst W. Mayr; Hans Stadtherr

    2016-01-01

    Consider the problem of finding a minimum length schedule for an unit execution time tasks on m processors with tree-like precedence constraints. A sequential algorithm can solve this problem in linear time. The fastest known parallel algorithm needs O(log n) time using n^2 processors. For the case m=2 we present two work optimal parallel algorithms that produce greedy optimal schedules for intrees and outtrees. Both run in O(log n) time using n/(log n) processors of an EREW PRAM.

  6. An iterative expanding and shrinking process for processor allocation in mixed-parallel workflow scheduling.

    Science.gov (United States)

    Huang, Kuo-Chan; Wu, Wei-Ya; Wang, Feng-Jian; Liu, Hsiao-Ching; Hung, Chun-Hao

    2016-01-01

    Parallel computation has been widely applied in a variety of large-scale scientific and engineering applications. Many studies indicate that exploiting both task and data parallelisms, i.e. mixed-parallel workflows, to solve large computational problems can get better efficacy compared with either pure task parallelism or pure data parallelism. Scheduling traditional workflows of pure task parallelism on parallel systems has long been known to be an NP-complete problem. Mixed-parallel workflow scheduling has to deal with an additional challenging issue of processor allocation. In this paper, we explore the processor allocation issue in scheduling mixed-parallel workflows of moldable tasks, called M-task, and propose an Iterative Allocation Expanding and Shrinking (IAES) approach. Compared to previous approaches, our IAES has two distinguishing features. The first is allocating more processors to the tasks on allocated critical paths for effectively reducing the makespan of workflow execution. The second is allowing the processor allocation of an M-task to shrink during the iterative procedure, resulting in a more flexible and effective process for finding better allocation. The proposed IAES approach has been evaluated with a series of simulation experiments and compared to several well-known previous methods, including CPR, CPA, MCPA, and MCPA2. The experimental results indicate that our IAES approach outperforms those previous methods significantly in most situations, especially when nodes of the same layer in a workflow might have unequal workloads.

  7. JIST: Just-In-Time Scheduling Translation for Parallel Processors

    Directory of Open Access Journals (Sweden)

    Giovanni Agosta

    2005-01-01

    Full Text Available The application fields of bytecode virtual machines and VLIW processors overlap in the area of embedded and mobile systems, where the two technologies offer different benefits, namely high code portability, low power consumption and reduced hardware cost. Dynamic compilation makes it possible to bridge the gap between the two technologies, but special attention must be paid to software instruction scheduling, a must for the VLIW architectures. We have implemented JIST, a Virtual Machine and JIT compiler for Java Bytecode targeted to a VLIW processor. We show the impact of various optimizations on the performance of code compiled with JIST through the experimental study on a set of benchmark programs. We report significant speedups, and increments in the number of instructions issued per cycle up to 50% with respect to the non-scheduling version of the JITcompiler. Further optimizations are discussed.

  8. PERFORMANCE EVALUATION OF OR1200 PROCESSOR WITH EVOLUTIONARY PARALLEL HPRC USING GEP

    Directory of Open Access Journals (Sweden)

    R. Maheswari

    2012-04-01

    Full Text Available In this fast computing era, most of the embedded system requires more computing power to complete the complex function/ task at the lesser amount of time. One way to achieve this is by boosting up the processor performance which allows processor core to run faster. This paper presents a novel technique of increasing the performance by parallel HPRC (High Performance Reconfigurable Computing in the CPU/DSP (Digital Signal Processor unit of OR1200 (Open Reduced Instruction Set Computer (RISC 1200 using Gene Expression Programming (GEP an evolutionary programming model. OR1200 is a soft-core RISC processor of the Intellectual Property cores that can efficiently run any modern operating system. In the manufacturing process of OR1200 a parallel HPRC is placed internally in the Integer Execution Pipeline unit of the CPU/DSP core to increase the performance. The GEP Parallel HPRC is activated /deactivated by triggering the signals i HPRC_Gene_Start ii HPRC_Gene_End. A Verilog HDL(Hardware Description language functional code for Gene Expression Programming parallel HPRC is developed and synthesised using XILINX ISE in the former part of the work and a CoreMark processor core benchmark is used to test the performance of the OR1200 soft core in the later part of the work. The result of the implementation ensures the overall speed-up increased to 20.59% by GEP based parallel HPRC in the execution unit of OR1200.

  9. Adaptive data-driven parallelization of multi-view video coding on multi-core processor

    Institute of Scientific and Technical Information of China (English)

    PANG Yi; HU WeiDong; SUN LiFeng; YANG ShiQiang

    2009-01-01

    Multi-view video coding (MVC) comprises rich 3D information and is widely used in new visual media, such as 3DTV and free viewpoint TV (FTV). However, even with mainstream computer manufacturers migrating to multi-core processors, the huge computational requirement of MVC currently prohibits its wide use in consumer markets. In this paper, we demonstrate the design and implementation of the first parallel MVC system on Cell Broadband EngineTM processor which is a state-of-the-art multi-core processor. We propose a task-dispatching algorithm which is adaptive data-driven on the frame level for MVC, and implement a parallel multi-view video decoder with modified H.264/AVC codec on real machine. This approach provides scalable speedup (up to 16 times on sixteen cores) through proper local store management, utilization of code locality and SIMD improvement. Decoding speed, speedup and utilization rate of cores are expressed in experimental results.

  10. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    OpenAIRE

    Kaufmann Michael; Nieselt Kay; Schmollinger Martin; Morgenstern Burkhard

    2004-01-01

    Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a) pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pa...

  11. Parallel computing of discrete element method on multi-core processors

    Institute of Scientific and Technical Information of China (English)

    Yusuke Shigeto; Mikio Sakai

    2011-01-01

    This paper describes parallel simulation techniques for the discrete element method (DEM) on multi-core processors.Recently,multi-core CPU and GPU processors have attracted much attention in accelerating computer simulations in various fields.We propose a new algorithm for multi-thread parallel computation of DEM,which makes effective use of the available memory and accelerates the computation.This study shows that memory usage is drastically reduced by using this algorithm.To show the practical use of DEM in industry,a large-scale powder system is simulated with a complicated drive unit.We compared the performance of the simulation between the latest GPU and CPU processors with optimized programs for each processor.The results show that the difference in performance is not substantial when using either GPUs or CPUs with a multi-thread parallel algorithm.In addition,DEM algorithm is shown to have high scalability in a multi-thread parallel computation on a CPU.

  12. A full parallel radix sorting algorithm for multicore processors

    OpenAIRE

    Maus, Arne

    2011-01-01

    The problem addressed in this paper is that we want to sort an integer array a [] of length n on a multi core machine with k cores. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation of that algorithm. This paper introduces PARL, a parallel left radix sorting algorithm for use on ordinary shared memory multi core machines, that has just one simple statement in its sequential part. It can be seen a...

  13. Technology Development and Circuit Design for a Parallel Laser Programmable Floating Point Application Specific Processor

    Science.gov (United States)

    1989-12-01

    The laser programmable floating point application specific processor (LPASP) is a new approach at rapid development of custom VLSI chips. The LPASP...double precision floating point adder and the laser programmable read-only memory (LPROM) that are macrocells within the LPASP. In addition, the...thesis analyzes the applicability of an LPASP parallel processing system. The double precision floating point adder is an adder/subtractor macrocell

  14. Scalable High-Performance Parallel Design for Network Intrusion Detection Systems on Many-Core Processors

    OpenAIRE

    Jiang, Hayang; Xie, Gaogang; Salamatian, Kavé; Mathy, Laurent

    2013-01-01

    Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. Both hardware accelerated and parallel software-based NIDS solutions, based on commodity multi-core and GPU processors, have been proposed to overcome these challenges. Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. ...

  15. Parallel Algorithms for Medical Informatics on Data-Parallel Many-Core Processors

    OpenAIRE

    Moazeni, Maryam

    2013-01-01

    The extensive use of medical monitoring devices has resulted in the generation of tremendous amounts of data. Storage, retrieval, and analysis of such data require platforms that can scale with data growth and adapt to the various behavior of the analysis and processing algorithms. In recent years, many-core processors and more specifically many-core Graphical Processing Units (GPUs) have become one of the most promising platforms for high performance processing of data, due to the massive pa...

  16. Data flow analysis of a highly parallel processor for a level 1 pixel trigger

    Energy Technology Data Exchange (ETDEWEB)

    Cancelo, G. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Gottschalk, Erik Edward [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Pavlicek, V. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Wang, M. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Wu, J. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States)

    2003-01-01

    The present work describes the architecture and data flow analysis of a highly parallel processor for the Level 1 Pixel Trigger for the BTeV experiment at Fermilab. First the Level 1 Trigger system is described. Then the major components are analyzed by resorting to mathematical modeling. Also, behavioral simulations are used to confirm the models. Results from modeling and simulations are fed back into the system in order to improve the architecture, eliminate bottlenecks, allocate sufficient buffering between processes and obtain other important design parameters. An interesting feature of the current analysis is that the models can be extended to a large class of architectures and parallel systems.

  17. DIALIGN P: Fast pair-wise and multiple sequence alignment using parallel processors

    Directory of Open Access Journals (Sweden)

    Kaufmann Michael

    2004-09-01

    Full Text Available Abstract Background Parallel computing is frequently used to speed up computationally expensive tasks in Bioinformatics. Results Herein, a parallel version of the multi-alignment program DIALIGN is introduced. We propose two ways of dividing the program into independent sub-routines that can be run on different processors: (a pair-wise sequence alignments that are used as a first step to multiple alignment account for most of the CPU time in DIALIGN. Since alignments of different sequence pairs are completely independent of each other, they can be distributed to multiple processors without any effect on the resulting output alignments. (b For alignments of large genomic sequences, we use a heuristics by splitting up sequences into sub-sequences based on a previously introduced anchored alignment procedure. For our test sequences, this combined approach reduces the program running time of DIALIGN by up to 97%. Conclusions By distributing sub-routines to multiple processors, the running time of DIALIGN can be crucially improved. With these improvements, it is possible to apply the program in large-scale genomics and proteomics projects that were previously beyond its scope.

  18. Parallel Information Transfer in a Multi-Node Quantum Information Processor

    CERN Document Server

    Borneman, Troy W; Cory, David G

    2011-01-01

    We describe a method for coupling disjoint quantum bits (qubits) in different local processing nodes of a distributed node quantum information processor. An effective channel for information transfer between nodes is obtained by moving the system into an interaction frame where all pairs of cross-node qubits are effectively coupled via an exchange interaction between actuator elements of each node. All control is achieved via actuator-only modulation, leading to fast implementations of a universal set of internode quantum gates. The method is expected to be nearly independent of actuator decoherence and may be made insensitive to experimental variations of system parameters by appropriate design of control sequences. We show, in particular, how the induced cross-node coupling channel may be used to swap the complete quantum states of the local processors in parallel.

  19. Fast structural design and analysis via hybrid domain decomposition on massively parallel processors

    Science.gov (United States)

    Farhat, Charbel

    1993-01-01

    A hybrid domain decomposition framework for static, transient and eigen finite element analyses of structural mechanics problems is presented. Its basic ingredients include physical substructuring and /or automatic mesh partitioning, mapping algorithms, 'gluing' approximations for fast design modifications and evaluations, and fast direct and preconditioned iterative solvers for local and interface subproblems. The overall methodology is illustrated with the structural design of a solar viewing payload that is scheduled to fly in March 1993. This payload has been entirely designed and validated by a group of undergraduate students at the University of Colorado using the proposed hybrid domain decomposition approach on a massively parallel processor. Performance results are reported on the CRAY Y-MP/8 and the iPSC-860/64 Touchstone systems, which represent both extreme parallel architectures. The hybrid domain decomposition methodology is shown to outperform leading solution algorithms and to exhibit an excellent parallel scalability.

  20. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    Science.gov (United States)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2017-08-01

    For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  1. Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

    Directory of Open Access Journals (Sweden)

    Cerati Giuseppe

    2017-01-01

    Full Text Available For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU, ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC, for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.

  2. Compiling the functional data-parallel language SaC for Microgrids of Self-Adaptive Virtual Processors

    NARCIS (Netherlands)

    Grelck, C.; Herhut, S.; Jesshope, C.; Joslin, C.; Lankamp, M.; Scholz, S.-B.; Shafarenko, A.

    2009-01-01

    We present preliminary results from compiling the high-level, functional and data-parallel programming language SaC into a novel multi-core design: Microgrids of Self-Adaptive Virtual Processors (SVPs). The side-effect free nature of SaC in conjunction with its data-parallel foundation make it an id

  3. Computing effective properties of random heterogeneous materials on heterogeneous parallel processors

    Science.gov (United States)

    Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

    2012-11-01

    In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.

  4. A High-Efficiency Monolithic DC-DC PFM Boost Converter with Parallel Power MOS Technique

    Directory of Open Access Journals (Sweden)

    Hou-Ming Chen

    2013-01-01

    Full Text Available This paper presents a high-efficiency monolithic dc-dc PFM boost converter designed with a standard TSMC 3.3/5V 0.35 μm CMOS technology. The proposed boost converter combines the parallel power MOS technique with pulse-frequency modulation (PFM technique to achieve high efficiency over a wide load current range, extending battery life and reducing the cost for the portable systems. The proposed parallel power MOS controller and load current detector exactly determine the size of power MOS to increase power conversion efficiency in different loads. Postlayout simulation results of the designed circuit show that the power conversion is 74.9–90.7% efficiency over a load range from 1 mA to 420 mA with 1.5 V supply. Moreover, the proposed boost converter has a smaller area and lower cost than those of the existing boost converter circuits.

  5. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    Directory of Open Access Journals (Sweden)

    Matthew O'keefe

    1995-01-01

    Full Text Available Massively parallel processors (MPPs hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. We have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.

  6. Preliminary design of an advanced programmable digital filter network for large passive acoustic ASW systems. [Parallel processor

    Energy Technology Data Exchange (ETDEWEB)

    McWilliams, T.; Widdoes, Jr., L. C.; Wood, L.

    1976-09-30

    The design of an extremely high performance programmable digital filter of novel architecture, the LLL Programmable Digital Filter, is described. The digital filter is a high-performance multiprocessor having general purpose applicability and high programmability; it is extremely cost effective either in a uniprocessor or a multiprocessor configuration. The architecture and instruction set of the individual processor was optimized with regard to the multiple processor configuration. The optimal structure of a parallel processing system was determined for addressing the specific Navy application centering on the advanced digital filtering of passive acoustic ASW data of the type obtained from the SOSUS net. 148 figures. (RWR)

  7. Development of radiation tolerant monolithic active pixel sensors with fast column parallel read-out

    Science.gov (United States)

    Koziel, M.; Dorokhov, A.; Fontaine, J.-C.; De Masi, R.; Winter, M.

    2010-12-01

    Monolithic active pixel sensors (MAPS) [1] (Turchetta et al., 2001) are being developed at IPHC—Strasbourg to equip the EUDET telescope [2] (Haas, 2006) and vertex detectors for future high energy physics experiments, including the STAR upgrade at RHIC [3] (T.S. Collaboration, 2005) and the CBM experiment at FAIR/GSI [4] (Heuser, 2006). High granularity, low material budget and high read-out speed are systematically required for most applications, complemented, for some of them, with high radiation tolerance. A specific column-parallel architecture, implemented in the MIMOSA-22 sensor, was developed to achieve fast read-out MAPS. Previous studies of the front-end architecture integrated in this sensor, which includes in-pixel amplification, have shown that the fixed pattern noise increase consecutive to ionizing radiation can be controlled by means of a negative feedback [5] (Hu-Guo et al., 2008). However, an unexpected rise of the temporal noise was observed. A second version of this chip (MIMOSA-22bis) was produced in order to search for possible improvements of the radiation tolerance, regarding this type of noise. In this prototype, the feedback transistor was tuned in order to mitigate the sensitivity of the pixel to ionizing radiation. The performances of the pixels after irradiation were investigated for two types of feedback transistors: enclosed layout transistor (ELT) [6] (Snoeys et al., 2000) and "standard" transistor with either large or small transconductance. The noise performance of all test structures was studied in various conditions (expected in future experiments) regarding temperature, integration time and ionizing radiation dose. Test results are presented in this paper. Based on these observations, ideas for further improvement of the radiation tolerance of column parallel MAPS are derived.

  8. Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

    Science.gov (United States)

    Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

    2012-01-01

    The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

  9. Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems

    Science.gov (United States)

    Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

    2012-01-01

    The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

  10. Performance evaluation of the HEP, ELXSI and CRAY X-MP parallel processors on hydrocode test problems

    Energy Technology Data Exchange (ETDEWEB)

    Liebrock, L.M.; McGrath, J.F.; Hicks, D.L.

    1986-07-07

    Parallel programming promises improved processing speeds for hydrocodes, magnetohydrocodes, multiphase flow codes, thermal-hydraulics codes, wavecodes and other continuum dynamics codes. This paper presents the results of some investigations of parallel algorithms on three parallel processors: the CRAY X-MP, ELXSI and the HEP computers. Introduction and Background: We report the results of investigations of parallel algorithms for computational continuum dynamics. These programs (hydrocodes, wavecodes, etc.) produce simulations of the solutions to problems arising in the motion of continua: solid dynamics, liquid dynamics, gas dynamics, plasma dynamics, multiphase flow dynamics, thermal-hydraulic dynamics and multimaterial flow dynamics. This report restricts its scope to one-dimensional algorithms such as the von Neumann-Richtmyer (1950) scheme.

  11. Choosing processor array configuration by performance modeling for a highly parallel linear algebra algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Littlefield, R.J.; Maschhoff, K.J.

    1991-04-01

    Many linear algebra algorithms utilize an array of processors across which matrices are distributed. Given a particular matrix size and a maximum number of processors, what configuration of processors, i.e., what size and shape array, will execute the fastest The answer to this question depends on tradeoffs between load balancing, communication startup and transfer costs, and computational overhead. In this paper we analyze in detail one algorithm: the blocked factored Jacobi method for solving dense eigensystems. A performance model is developed to predict execution time as a function of the processor array and matrix sizes, plus the basic computation and communication speeds of the underlying computer system. In experiments on a large hypercube (up to 512 processors), this model has been found to be highly accurate (mean error {approximately} 2%) over a wide range of matrix sizes (10 {times} 10 through 200 {times} 200) and processor counts (1 to 512). The model reveals, and direct experiment confirms, that the tradeoffs mentioned above can be surprisingly complex and counterintuitive. We propose decision procedures based directly on the performance model to choose configurations for fastest execution. The model-based decision procedures are compared to a heuristic strategy and shown to be significantly better. 7 refs., 8 figs., 1 tab.

  12. Fast space-filling molecular graphics using dynamic partitioning among parallel processors.

    Science.gov (United States)

    Gertner, B J; Whitnell, R M; Wilson, K R

    1991-09-01

    We present a novel algorithm for the efficient generation of high-quality space-filling molecular graphics that is particularly appropriate for the creation of the large number of images needed in the animation of molecular dynamics. Each atom of the molecule is represented by a sphere of an appropriate radius, and the image of the sphere is constructed pixel-by-pixel using a generalization of the lighting model proposed by Porter (Comp. Graphics 1978, 12, 282). The edges of the spheres are antialiased, and intersections between spheres are handled through a simple blending algorithm that provides very smooth edges. We have implemented this algorithm on a multiprocessor computer using a procedure that dynamically repartitions the effort among the processors based on the CPU time used by each processor to create the previous image. This dynamic reallocation among processors automatically maximizes efficiency in the face of both the changing nature of the image from frame to frame and the shifting demands of the other programs running simultaneously on the same processors. We present data showing the efficiency of this multiprocessing algorithm as the number of processors is increased. The combination of the graphics and multiprocessor algorithms allows the fast generation of many high-quality images.

  13. Performance of parallel linear algebra algorithms on an interleaved array processor

    Energy Technology Data Exchange (ETDEWEB)

    Wolf, G.

    1986-01-01

    This thesis investigates the performance of numerical algorithms that take advantage of the unique properties of a new type of array processor. This processor, called an Interleaved Array Processor (IAP), is characterized by its ability to use multiple Programmable Functional Units (PFU's) with local data and instruction memories. Unlike conventional array processors, which can only execute simple arithmetic vector operations such as addition and multiplication, the IAP can execute complex vector operations defined by the user. These operations are specified by small programs that can contain conditional branching as well as arithmetic and data movement instructions in each processor. The author calls these programs High-Level Vector Operations (HLVO's). Ways to partition the algorithms and the data among the processing units in the system are presented so that in such a way that the computation time in every processing unit is increased, and at the same time the data movement on the system bus, is reduced. In this way the bus can be timeshared among several functional units, allowing several operations on different vector components to be executed simultaneously and overlapped with the transfer of operands and results.

  14. Parallel Performance of MPI Sorting Algorithms on Dual-Core Processor Windows-Based Systems

    CERN Document Server

    Elnashar, Alaa Ismail

    2011-01-01

    Message Passing Interface (MPI) is widely used to implement parallel programs. Although Windowsbased architectures provide the facilities of parallel execution and multi-threading, little attention has been focused on using MPI on these platforms. In this paper we use the dual core Window-based platform to study the effect of parallel processes number and also the number of cores on the performance of three MPI parallel implementations for some sorting algorithms.

  15. A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor

    CERN Document Server

    Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

    2014-01-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. ...

  16. A Parallel FPGA Implementation for Real-Time 2D Pixel Clustering for the ATLAS Fast TracKer Processor

    CERN Document Server

    Sotiropoulou, C-L; The ATLAS collaboration; Annovi, A; Beretta, M; Kordas, K; Nikolaidis, S; Petridou, C; Volpi, G

    2014-01-01

    The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. T...

  17. A data parallel digitizer for a time-based simulation of CMOS Monolithic Active Pixel Sensors with FairRoot

    Science.gov (United States)

    Sitzmann, P.; Amar-Youcef, S.; Doering, D.; Deveaux, M.; Fröhlich, I.; Koziel, M.; Krebs, E.; Linnik, B.; Michel, J.; Milanovic, B.; Müntz, C.; Li, Q.; Stroth, J.; Tischler, T.

    2014-06-01

    CMOS Monolithic Active Pixel Sensors (MAPS) demonstrated excellent performances in the field of charged particle tracking. They feature an excellent single point resolution of few μm, a light material budget of 0.05% Xo in combination with a good radiation tolerance and time resolution. This makes the sensors a valuable technology for micro vertex detectors (MVD) of various experiments in heavy ion and particle physics like STAR and CBM. State of the art MAPS are equipped with a rolling shutter readout. Therefore, the data of one individual event is typically found in more than one data train generated by the sensor. This paper presents a concept to introduce this feature in both simulation and data analysis, taking profit of the sensor topology of the MVD. This topology allows to use for massive parallel data streaming and handling strategies within the FairRoot framework.

  18. Dedicated optoelectronic stochastic parallel processor for real-time image processing: motion-detection demonstration and design of a hybrid complementary-metal-oxide semiconductor- self-electro-optic-device-based prototype.

    Science.gov (United States)

    Cassinelli, A; Chavel, P; Desmulliez, M P

    2001-12-10

    We report experimental results and performance analysis of a dedicated optoelectronic processor that implements stochastic optimization-based image-processing tasks in real time. We first show experimental results using a proof-of-principle-prototype demonstrator based on standard silicon-complementary-metal-oxide-semiconductor (CMOS) technology and liquid-crystal spatial light modulators. We then elaborate on the advantages of using a hybrid CMOS-self-electro-optic-device-based smart-pixel array to monolithically integrate photodetectors and modulators on the same chip, providing compact, high-bandwidth intrachip optoelectronic interconnects. We have modeled the operation of the monolithic processor, clearly showing system-performance improvement.

  19. Low-power, real-time digital video stabilization using the HyperX parallel processor

    Science.gov (United States)

    Hunt, Martin A.; Tong, Lin; Bindloss, Keith; Zhong, Shang; Lim, Steve; Schmid, Benjamin J.; Tidwell, J. D.; Willson, Paul D.

    2011-06-01

    Coherent Logix has implemented a digital video stabilization algorithm for use in soldier systems and small unmanned air / ground vehicles that focuses on significantly reducing the size, weight, and power as compared to current implementations. The stabilization application was implemented on the HyperX architecture using a dataflow programming methodology and the ANSI C programming language. The initial implementation is capable of stabilizing an 800 x 600, 30 fps, full color video stream with a 53ms frame latency using a single 100 DSP core HyperX hx3100TM processor running at less than 3 W power draw. By comparison an Intel Core2 Duo processor running the same base algorithm on a 320x240, 15 fps stream consumes on the order of 18W. The HyperX implementation is an overall 100x improvement in performance (processing bandwidth increase times power improvement) over the GPP based platform. In addition the implementation only requires a minimal number of components to interface directly to the imaging sensor and helmet mounted display or the same computing architecture can be used to generate software defined radio waveforms for communications links. In this application, the global motion due to the camera is measured using a feature based algorithm (11 x 11 Difference of Gaussian filter and Features from Accelerated Segment Test) and model fitting (Random Sample Consensus). Features are matched in consecutive frames and a control system determines the affine transform to apply to the captured frame that will remove or dampen the camera / platform motion on a frame-by-frame basis.

  20. Design of a Content Addressable Memory-based Parallel Processor implementing (−1+j-based Binary Number System

    Directory of Open Access Journals (Sweden)

    Tariq Jamil

    2014-11-01

    Full Text Available Contrary to the traditional base 2 binary number system, used in today’s computers, in which a complex number is represented by two separate binary entities, one for the real part and one for the imaginary part, Complex Binary Number System (CBNS, a binary number system with base (−1+j, is used to represent a given complex number in single binary string format. In this paper, CBNS is reviewed and arithmetic algorithms for this number system are presented. The design of a CBNS-based parallel processor utilizing content-addressable memory for implementation of associative dataflow concept has been described and software-related issues have also been explained.

  1. Kemari: A Portable High Performance Fortran System for Distributed Memory Parallel Processors

    Directory of Open Access Journals (Sweden)

    T. Kamachi

    1997-01-01

    Full Text Available We have developed a compilation system which extends High Performance Fortran (HPF in various aspects. We support the parallelization of well-structured problems with loop distribution and alignment directives similar to HPF's data distribution directives. Such directives give both additional control to the user and simplify the compilation process. For the support of unstructured problems, we provide directives for dynamic data distribution through user-defined mappings. The compiler also allows integration of message-passing interface (MPI primitives. The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer. After an overview of the compiler, we describe the language extensions and related compilation mechanisms in detail. Performance measurements demonstrate the compiler's applicability to a variety of application classes.

  2. A two-stage flexible flow-shop scheduling problem with m identical parallel machines on one stage and a batch processor on the other stage

    Institute of Scientific and Technical Information of China (English)

    HE Long-min; SUN Shi-jie; CHENG Ming-bao

    2008-01-01

    This paper considers a hybrid two-stage flow-shop scheduling problem with m identical parallel machineson one stage and a batch processor on the other stage.The processing time of job Jj on any of m identical parallel machines is aj≡a(j∈N),and the processing time of job Jj is bj(j∈N)on a batch processor M.We take makespan(Cmax)as our minimization objective.In this paper,for the problem of FSMP-BI(m identical parallel machines on the first stage and a batch processor on the second stage),based on the algorithm given by Sung and Choung for the problem of l I rj,BI I Cmax under the constraint of the given processing sequence,we develop an optimal dynamic programming Algorithm H1 for it in max{O(nlogn),O(nB)} time.A max{O(nlogn),O(nB)} time symmetric Algorithm H2 is given then for the problem of BI-FSMP(a batch processor on the first stage and m identical parallel machines on the second stage).

  3. Trajectory optimization for real-time guidance. I - Time-varying LQR on a parallel processor

    Science.gov (United States)

    Psiaki, Mark L.; Park, Kihong

    1990-01-01

    A key algorithmic element of a real-time trajectory optimization hardware/software implementation, the quadratic program (QP) solver element, is presented. The purpose of the effort is to make nonlinear trajectory optimization fast enough to provide real-time commands during guidance of a vehicle such as an aeromaneuvering orbiter. Many methods of nonlinear programming require the solution of a QP at each iteration. In the trajectory optimization case the QP has a special dynamic programming structure, a LQR-like structure. QP algorithm speed is increased by taking advantage of this special structure and by parallel implementation.

  4. Optimizing ion channel models using a parallel genetic algorithm on graphical processors.

    Science.gov (United States)

    Ben-Shalom, Roy; Aviv, Amit; Razon, Benjamin; Korngreen, Alon

    2012-01-01

    We have recently shown that we can semi-automatically constrain models of voltage-gated ion channels by combining a stochastic search algorithm with ionic currents measured using multiple voltage-clamp protocols. Although numerically successful, this approach is highly demanding computationally, with optimization on a high performance Linux cluster typically lasting several days. To solve this computational bottleneck we converted our optimization algorithm for work on a graphical processing unit (GPU) using NVIDIA's CUDA. Parallelizing the process on a Fermi graphic computing engine from NVIDIA increased the speed ∼180 times over an application running on an 80 node Linux cluster, considerably reducing simulation times. This application allows users to optimize models for ion channel kinetics on a single, inexpensive, desktop "super computer," greatly reducing the time and cost of building models relevant to neuronal physiology. We also demonstrate that the point of algorithm parallelization is crucial to its performance. We substantially reduced computing time by solving the ODEs (Ordinary Differential Equations) so as to massively reduce memory transfers to and from the GPU. This approach may be applied to speed up other data intensive applications requiring iterative solutions of ODEs.

  5. A Design of a New Column-Parallel Analog-to-Digital Converter Flash for Monolithic Active Pixel Sensor.

    Science.gov (United States)

    Chakir, Mostafa; Akhamal, Hicham; Qjidaa, Hassan

    2017-01-01

    The CMOS Monolithic Active Pixel Sensor (MAPS) for the International Linear Collider (ILC) vertex detector (VXD) expresses stringent requirements on their analog readout electronics, specifically on the analog-to-digital converter (ADC). This paper concerns designing and optimizing a new architecture of a low power, high speed, and small-area 4-bit column-parallel ADC Flash. Later in this study, we propose to interpose an S/H block in the converter. This integration of S/H block increases the sensitiveness of the converter to the very small amplitude of the input signal from the sensor and provides a sufficient time to the converter to be able to code the input signal. This ADC is developed in 0.18 μm CMOS process with a pixel pitch of 35 μm. The proposed ADC responds to the constraints of power dissipation, size, and speed for the MAPS composed of a matrix of 64 rows and 48 columns where each column ADC covers a small area of 35 × 336.76 μm(2). The proposed ADC consumes low power at a 1.8 V supply and 100 MS/s sampling rate with dynamic range of 125 mV. Its DNL and INL are 0.0812/-0.0787 LSB and 0.0811/-0.0787 LSB, respectively. Furthermore, this ADC achieves a high speed more than 5 GHz.

  6. A Parallel and Concurrent Implementation of Lin-Kernighan Heuristic (LKH-2 for Solving Traveling Salesman Problem for Multi-Core Processors using SPC3 Programming Model

    Directory of Open Access Journals (Sweden)

    Muhammad Ali Ismail

    2011-08-01

    Full Text Available With the arrival of multi-cores, every processor has now built-in parallel computational power and that can be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core processors. In this paper we have presented a combined parallel and concurrent implementation of Lin-Kernighan Heuristic (LKH-2 for Solving Travelling Salesman Problem (TSP using a newly developed parallel programming model, SPC3 PM, for general purpose multi-core processors. This implementation is found to be very simple, highly efficient, scalable and less time consuming in compare to the existing LKH-2 serial implementations in multi-core processing environment. We have tested our parallel implementation of LKH-2 with medium and large size TSP instances of TSBLIB. And for all these tests our proposed approach has shown much improved performance and scalability.

  7. Design and Realization of High Performance Parallel FFT Processor%高性能并行FFT处理器的设计与实现

    Institute of Scientific and Technical Information of China (English)

    石长振; 杨雪; 王贞松

    2012-01-01

    This paper proposes a design of high performance parallel Fast Fourier Transform(FFT) processor. It uses four butterfly units in parallel processing. It uses an improved conflict free memory addressing method, and 16 data can be read, processed and written in one cycle simultaneously. It gives the FPGA implementation of the processor, performance evaluation results show that the high performance parallel FFT processor is superior to other FFT processors, and can meet the application needs.%提出一种高性能并行快速傅里叶变换(FFT)处理器的设计方案,采用4个蝶形单元进行并行处理,利用改进的无冲突操作数地址映射方式,保证每个周期同时读取和写入16个数据.给出该处理器的FPGA实现,性能评测结果表明,与其他FFT处理器相比,该并行FFT处理器的性能较优,能满足实际应用需求.

  8. Graphics-processor-unit-based parallelization of optimized baseline wander filtering algorithms for long-term electrocardiography.

    Science.gov (United States)

    Niederhauser, Thomas; Wyss-Balmer, Thomas; Haeberlin, Andreas; Marisa, Thanks; Wildhaber, Reto A; Goette, Josef; Jacomet, Marcel; Vogel, Rolf

    2015-06-01

    Long-term electrocardiogram (ECG) often suffers from relevant noise. Baseline wander in particular is pronounced in ECG recordings using dry or esophageal electrodes, which are dedicated for prolonged registration. While analog high-pass filters introduce phase distortions, reliable offline filtering of the baseline wander implies a computational burden that has to be put in relation to the increase in signal-to-baseline ratio (SBR). Here, we present a graphics processor unit (GPU)-based parallelization method to speed up offline baseline wander filter algorithms, namely the wavelet, finite, and infinite impulse response, moving mean, and moving median filter. Individual filter parameters were optimized with respect to the SBR increase based on ECGs from the Physionet database superimposed to autoregressive modeled, real baseline wander. A Monte-Carlo simulation showed that for low input SBR the moving median filter outperforms any other method but negatively affects ECG wave detection. In contrast, the infinite impulse response filter is preferred in case of high input SBR. However, the parallelized wavelet filter is processed 500 and four times faster than these two algorithms on the GPU, respectively, and offers superior baseline wander suppression in low SBR situations. Using a signal segment of 64 mega samples that is filtered as entire unit, wavelet filtering of a seven-day high-resolution ECG is computed within less than 3 s. Taking the high filtering speed into account, the GPU wavelet filter is the most efficient method to remove baseline wander present in long-term ECGs, with which computational burden can be strongly reduced.

  9. Extended testing of a general contextual classifier using the massively parallel processor - Preliminary results and test plans. [for thematic mapping

    Science.gov (United States)

    Tilton, J. C.

    1985-01-01

    Earlier encouraging test results of a contextual classifier that combines spatial and spectral information employing a general statistical approach are expanded. The earlier results were of limited meaning because they were produced from small (50-by-50 pixel) data sets. An implementation of the contextual classifier on NASA Goddard's Massively Parallel Processor (MPP) is presented; for the first time the MPP makes feasible the testing of the classifier on large data sets (a 12-hour test on a VAX-11/780 minicomputer now takes 5 minutes on the MPP). The MPP is a Single-Instruction, Multiple Data Stream computer, consisting of 16,384 bit serial microprocessors connected in a 128-by-128 mesh array with each element having data transfer connections with its four nearest neighbors so that the MPP is capable of billions of operations per second. Preliminary results are given (with more expected for the conference) and plans are mentioned for extended testing of the contextual classifier on Thematic Mapper data sets.

  10. A Highly Parallel FPGA Implementation of a 2D-Clustering Algorithm for the ATLAS Fast TracKer (FTK) Processor

    CERN Document Server

    Kimura, N; The ATLAS collaboration; Beretta, M; Gatta, M; Gkaitatzis, S; Iizawa, T; Kordas, K; Korikawa, T; Nikolaidis, N; Petridou, P; Sotiropoulou, C-L; Yorita, K; Volpi, G

    2014-01-01

    The highly parallel 2D-clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors read out drivers (RODs) at 760Gbps, the full rate of level 1 triggers. Clustering serves two purposes. The first is to reduce the high rate of the received data before further processing. The second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The implementation is fully generic, therefore the detection window size can be optimized for the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility ma...

  11. A 1,000 Frames/s Programmable Vision Chip with Variable Resolution and Row-Pixel-Mixed Parallel Image Processors

    Directory of Open Access Journals (Sweden)

    Nanjian Wu

    2009-07-01

    Full Text Available A programmable vision chip with variable resolution and row-pixel-mixed parallel image processors is presented. The chip consists of a CMOS sensor array, with row-parallel 6-bit Algorithmic ADCs, row-parallel gray-scale image processors, pixel-parallel SIMD Processing Element (PE array, and instruction controller. The resolution of the image in the chip is variable: high resolution for a focused area and low resolution for general view. It implements gray-scale and binary mathematical morphology algorithms in series to carry out low-level and mid-level image processing and sends out features of the image for various applications. It can perform image processing at over 1,000 frames/s (fps. A prototype chip with 64 × 64 pixels resolution and 6-bit gray-scale image is fabricated in 0.18 mm Standard CMOS process. The area size of chip is 1.5 mm × 3.5 mm. Each pixel size is 9.5 μm × 9.5 μm and each processing element size is 23 μm × 29 μm. The experiment results demonstrate that the chip can perform low-level and mid-level image processing and it can be applied in the real-time vision applications, such as high speed target tracking.

  12. Technology development and circuit design for a parallel laser programmable floating-point application specific processor. Master's thesis

    Energy Technology Data Exchange (ETDEWEB)

    Scriber, M.W.

    1989-12-01

    The laser programmable floating point application specific processor (LPASP) is a new approach at rapid development of custom VLSI chips. The LPASP is a generic application specific processor that can be programmed to perform a specific function. The effort of this thesis is to develop and test the double precision floating point adder and the laser programmable read-only memory (LPROM) that are macrocells within the LPASP. In addition, the thesis analyzes the applicability of an LPASP parallel processing system. The double precision floating point adder is an adder/subtractor macrocell designed to comply with the IEEE double precision floating point standard. An 84-pin chip of the adder was fabricated using 2 micron feature sizes. The fastest processing time was measured at 120 nanoseconds over 23 worst case test vectors. The adder uses the optimized carry multiplexed (OCM) adder that was developed at AFIT. The OCM adder is a new adder architecture that uses four parallel carry paths to attain a performance time on the order of (cubed root of M) with a gate count on the order of O (n). The redundant logic associated with the parallel propagation banks is eliminated in the OCM adder so that the largest bit-slice of the adder contains only eight 2-to-1 multiplexer gates. A 57-bit adder was fabricated using 2 micron feature sizes. The processing time for the adder is 31 nsec.

  13. Memory-Level Parallelism and Processor Microarchitecture%存储级并行与处理器微体系结构

    Institute of Scientific and Technical Information of China (English)

    谢伦国; 刘德峰

    2011-01-01

    As the gap between processor and memory performance increases, performance loss due to long-latency memory accesses become a primary problem. Memory-level parallelism (MLP) improves performance by accessing memory concurrently. In this paper, the authors review MLP's background, then give an introduction of the conception of MLP and the relation between MLP and processor performance model, and analyze the main limitation to the processor's MLP, emphatically detail all kinds of technologies to improve processor's MLP, at last, summarize the current existing issues, and provide some further interesting directions.%随着处理器和主存之间性能差距的不断增大,长延迟访存成为影响处理器性能的主要原因之一.存储级并行通过多个访存并行执行减少长延迟访存对处理器性能的影响.文中回顾了存储级并行出现的背景,介绍了存储级并行的概念及其与处理器性能模型之间的关系;分析了限制处理器存储级并行的主要因素;详细综述了提高处理器存储级并行的各种技术,进行了分析比较;最后分析讨论了该领域研究存在的问题和进一步的研究方向.

  14. Parallelization of applications for networks with homogeneous and heterogeneous processors; Parallelisation d`applications pour des reseaux de processeurs homogenes ou heterogenes

    Energy Technology Data Exchange (ETDEWEB)

    Colombet, L.

    1994-10-07

    The aim of this thesis is to study and develop efficient methods for parallelization of scientific applications on parallel computers with distributed memory. The first part presents two libraries of PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) communication tools. They allow implementation of programs on most parallel machines, but also on heterogeneous computer networks. This chapter illustrates the problems faced when trying to evaluate performances of networks with heterogeneous processors. To evaluate such performances, the concepts of speed-up and efficiency have been modified and adapted to account for heterogeneity. The second part deals with a study of parallel application libraries such as ScaLAPACK and with the development of communication masking techniques. The general concept is based on communication anticipation, in particular by pipelining message sending operations. Experimental results on Cray T3D and IBM SP1 machines validates the theoretical studies performed on basic algorithms of the libraries discussed above. Two examples of scientific applications are given: the first is a model of young stars for astrophysics and the other is a model of photon trajectories in the Compton effect. (J.S.). 83 refs., 65 figs., 24 tabs.

  15. A Highly Parallel FPGA Implementation of a 2D-Clustering Algorithm for the ATLAS Fast TracKer (FTK) Processor

    CERN Document Server

    Kimura, N; The ATLAS collaboration; Beretta, M; Gatta, M; Gkaitatzis, S; Iizawa, T; Kordas, K; Korikawa, T; Nikolaidis, S; Petridou, C; Sotiropoulou, C-L; Yorita, K; Volpi, G

    2014-01-01

    The highly parallel 2D-clustering FPGA implementation used for the input system of Fast TracKer (FTK) processor for the ATLAS experiment at Large Hadron Collider (LHC) at CERN is presented. The LHC after the 2013-2014 shutdown periods is expected to increase the luminosity, which will make more difficult to have efficient online selection of rare events due to the increasing of the overlapping collisions. FTK is highly-parallelized hardware system that allows improving online selection by real time track finding using silicon inner detector information. FTK system require Fast and robust clustering of hits position from silicon detector on FPGA. We show the development of original input boards and implemented clustering algorithm. For the complicated 2D-clustering, moving window technique is used to minimize the limited FPGA resources. Developed boards and implementation of the clustering algorithm has sufficient processing power to meet the specification for silicon inner detector of ATLAS for the maximum LH...

  16. Calculation of focal positions in an optical head for parallel data processing with a monolithic four-beam laser diode.

    Science.gov (United States)

    Shinoda, M

    2001-03-01

    A method for calculating focal positions in a multibeam optical head by use of a multibeam laser diode, in which conditions for misalignment of the light source are taken into consideration, is introduced. One calculates the focal positions by using the practical characteristics of a monolithic four-beam laser diode and the practical specifications of the optics in an optical head. The results show that each focal position is defocused mainly as a result of curvature of the fields of the lenses. The adaptability of focal positions for various calculated conditions is discussed from the standpoint of depth of focus.

  17. An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications.

    Science.gov (United States)

    Park, Seong-Wook; Park, Junyoung; Bong, Kyeongryeol; Shin, Dongjoo; Lee, Jinmook; Choi, Sungpill; Yoo, Hoi-Jun

    2015-12-01

    Deep Learning algorithm is widely used for various pattern recognition applications such as text recognition, object recognition and action recognition because of its best-in-class recognition accuracy compared to hand-crafted algorithm and shallow learning based algorithms. Long learning time caused by its complex structure, however, limits its usage only in high-cost servers or many-core GPU platforms so far. On the other hand, the demand on customized pattern recognition within personal devices will grow gradually as more deep learning applications will be developed. This paper presents a SoC implementation to enable deep learning applications to run with low cost platforms such as mobile or portable devices. Different from conventional works which have adopted massively-parallel architecture, this work adopts task-flexible architecture and exploits multiple parallelism to cover complex functions of convolutional deep belief network which is one of popular deep learning/inference algorithms. In this paper, we implement the most energy-efficient deep learning and inference processor for wearable system. The implemented 2.5 mm × 4.0 mm deep learning/inference processor is fabricated using 65 nm 8-metal CMOS technology for a battery-powered platform with real-time deep inference and deep learning operation. It consumes 185 mW average power, and 213.1 mW peak power at 200 MHz operating frequency and 1.2 V supply voltage. It achieves 411.3 GOPS peak performance and 1.93 TOPS/W energy efficiency, which is 2.07× higher than the state-of-the-art.

  18. Real-time Parallel Processing System Design and Implementation for Underwater Acoustic Communication Based on Multiple Processors

    Institute of Scientific and Technical Information of China (English)

    YAN Zhen-hua; HUANG Jian-guo; ZHANG Qun-fei; HE Cheng-bing

    2007-01-01

    ADSP-TS101 is a high performance DSP with good properties of parallel processing and high speed. According to the real-time processing requirements of underwater acoustic communication algorithms, a real-time parallel processing system with multi-channel synchronous sample, which is composed of multiple ADSP-TS101s, is designed and carried out.For the hardware design, field programmable gate array (FPGA) logical control is adopted for the design of multi-channel synchronous sample module and cluster/data flow associated pin connection mode is adopted for multiprocessing parallel processing configuration respectively. And the software is optimized by two kinds of communication ways: broadcast writing way through shared bus and point-to-point way through link ports. Through the whole system installation, connective debugging, and experiments in a lake, the results show that the real-time parallel processing system has good stability and real-time processing capability and meets the technical design requirements of real-time processing.

  19. General purpose parallel programing using new generation graphic processors: CPU vs GPU comparative analysis and opportunities research

    Directory of Open Access Journals (Sweden)

    Donatas Krušna

    2013-03-01

    Full Text Available OpenCL, a modern parallel heterogeneous system programming language, enables problems to be partitioned and executed on modern CPU and GPU hardware, this increases performance of such applications considerably. Since GPU's are optimized for floating point and vector operations and specialize in them, they outperform general purpose CPU's in this field greatly. This language greatly simplifies the creation of applications for such heterogeneous system since it's cross-platform, vendor independent and is embeddable , hence letting it be used in any other general purpose programming language via libraries. There is more and more tools being developed that are aimed at low level programmers and scientists or engineers alike, that are developing applications or libraries for CPU’s and GPU’s of today as well as other heterogeneous platforms. The tendency today is to increase the number of cores or CPU‘s in hopes of increasing performance, however the increasing difficulty of parallelizing applications for such systems and the even increasing overhead of communication and synchronization are limiting the potential performance. This means that there is a point at which increasing cores or CPU‘s will no longer increase applications performance, and even can diminish performance. Even though parallel programming and GPU‘s with stream computing capabilities have decreased the need for communication and synchronization (since only the final result needs to be committed to memory, however this still is a weak link in developing such applications.

  20. Parallel Computation on Multicore Processors Using Explicit Form of the Finite Element Method and C++ Standard Libraries

    Directory of Open Access Journals (Sweden)

    Rek Václav

    2016-11-01

    Full Text Available In this paper, the form of modifications of the existing sequential code written in C or C++ programming language for the calculation of various kind of structures using the explicit form of the Finite Element Method (Dynamic Relaxation Method, Explicit Dynamics in the NEXX system is introduced. The NEXX system is the core of engineering software NEXIS, Scia Engineer, RFEM and RENEX. It has the possibilities of multithreaded running, which can now be supported at the level of native C++ programming language using standard libraries. Thanks to the high degree of abstraction that a contemporary C++ programming language provides, a respective library created in this way can be very generalized for other purposes of usage of parallelism in computational mechanics.

  1. A Systolic Array RLS Processor

    OpenAIRE

    Asai, T.; Matsumoto, T.

    2000-01-01

    This paper presents the outline of the systolic array recursive least-squares (RLS) processor prototyped primarily with the aim of broadband mobile communication applications. To execute the RLS algorithm effectively, this processor uses an orthogonal triangularization technique known in matrix algebra as QR decomposition for parallel pipelined processing. The processor board comprises 19 application-specific integrated circuit chips, each with approximately one million gates. Thirty-two bit ...

  2. Monolithic spectrometer

    Science.gov (United States)

    Rajic, Slobodan; Egert, Charles M.; Kahl, William K.; Snyder, Jr., William B.; Evans, III, Boyd M.; Marlar, Troy A.; Cunningham, Joseph P.

    1998-01-01

    A monolithic spectrometer is disclosed for use in spectroscopy. The spectrometer is a single body of translucent material with positioned surfaces for the transmission, reflection and spectral analysis of light rays.

  3. 多核处理器并行程序的确定性重放研究∗%Deterministic Replay for Parallel Programs in Multi-Core Processors

    Institute of Scientific and Technical Information of China (English)

    高岚; 王锐; 钱德沛

    2013-01-01

      多核处理器并行程序的确定性重放是实现并行程序调试的有效手段,对并行编程有重要意义。但由于多核架构下存在共享访存不同步问题,并行程序确定性重放的研究依然面临多方面的挑战,给并行程序的调试带来很大困难,严重影响了多核架构下并行程序的普及和发展。分析了多核处理器造成并行程序确定性重放难以实现的关键因素,总结了确定性重放的评价指标,综述了近年来学术界对并行程序确定性重放的研究。根据总结的评价指标,从纯软件方式和硬件支持方式对目前的确定性重放方法进行了分析与对比,并在此基础上对多核架构下并行程序的确定性重放未来的研究趋势和应用前景进行了展望。%The deterministic replay for parallel programs in multi-core processor systems is important for the debugging and dissemination of parallel programs, however, due to the difficulty in tackling unsynchronized accessing of shared memory in multiprocessors, industrial-level deterministic replay for parallel programs have not emerged yet. This paper analyzes non-deterministic events in multi-core processor systems and summarizes metrics of deterministic replay schemes. After studying the research for deterministic multi-core processor replay in recent years, this paper introduces the proposed deterministic replay schemes for parallel programs in multi-core processor systems, investigates characteristics of software-pure and hardware-assisted deterministic replay schemes, analyzes current researches and gives the prospects of deterministic replay for parallel programs in multi-core processor systems.

  4. Hardware multiplier processor

    Science.gov (United States)

    Pierce, Paul E.

    1986-01-01

    A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

  5. Tiled Multicore Processors

    Science.gov (United States)

    Taylor, Michael B.; Lee, Walter; Miller, Jason E.; Wentzlaff, David; Bratt, Ian; Greenwald, Ben; Hoffmann, Henry; Johnson, Paul R.; Kim, Jason S.; Psota, James; Saraf, Arvind; Shnidman, Nathan; Strumpen, Volker; Frank, Matthew I.; Amarasinghe, Saman; Agarwal, Anant

    For the last few decades Moore’s Law has continually provided exponential growth in the number of transistors on a single chip. This chapter describes a class of architectures, called tiled multicore architectures, that are designed to exploit massive quantities of on-chip resources in an efficient, scalable manner. Tiled multicore architectures combine each processor core with a switch to create a modular element called a tile. Tiles are replicated on a chip as needed to create multicores with any number of tiles. The Raw processor, a pioneering example of a tiled multicore processor, is examined in detail to explain the philosophy, design, and strengths of such architectures. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance. Central to achieving this goal is Raw’s ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Compared to a traditional superscalar processor, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x-9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are coded in a stream language or optimized by hand.

  6. Parallel architecture for rapid image generation and analysis

    Energy Technology Data Exchange (ETDEWEB)

    Nerheim, R.J.

    1987-01-01

    A multiprocessor architecture inspired by the Disney multiplane camera is proposed. For many applications, this approach produces a natural mapping of processors to objects in a scene. Such a mapping promotes parallelism and reduces the hidden-surface work with minimal interprocessor communication and low-overhead cost. Existing graphics architectures store the final picture as a monolithic entity. The architecture here stores each object's image separately. It assembles the final composite picture from component images only when the video display needs to be refreshed. This organization simplifies the work required to animate moving objects that occlude other objects. In addition, the architecture has multiple processors that generate the component images in parallel. This further shortens the time needed to create a composite picture. In addition to generating images for animation, the architecture has the ability to decompose images.

  7. Pressure drop in CIM disk monolithic columns.

    Science.gov (United States)

    Mihelic, Igor; Nemec, Damjan; Podgornik, Ales; Koloini, Tine

    2005-02-11

    Pressure drop analysis in commercial CIM disk monolithic columns is presented. Experimental measurements of pressure drop are compared to hydrodynamic models usually employed for prediction of pressure drop in packed beds, e.g. free surface model and capillary model applying hydraulic radius concept. However, the comparison between pressure drop in monolith and adequate packed bed give unexpected results. Pressure drop in a CIM disk monolithic column is approximately 50% lower than in an adequate packed bed of spheres having the same hydraulic radius as CIM disk monolith; meaning they both have the same porosity and the same specific surface area. This phenomenon seems to be a consequence of the monolithic porous structure which is quite different in terms of the pore size distribution and parallel pore nonuniformity compared to the one in conventional packed beds. The number of self-similar levels for the CIM monoliths was estimated to be between 1.03 and 2.75.

  8. Introduction to parallel programming

    CERN Document Server

    Brawer, Steven

    1989-01-01

    Introduction to Parallel Programming focuses on the techniques, processes, methodologies, and approaches involved in parallel programming. The book first offers information on Fortran, hardware and operating system models, and processes, shared memory, and simple parallel programs. Discussions focus on processes and processors, joining processes, shared memory, time-sharing with multiple processors, hardware, loops, passing arguments in function/subroutine calls, program structure, and arithmetic expressions. The text then elaborates on basic parallel programming techniques, barriers and race

  9. Parallel grid population

    Science.gov (United States)

    Wald, Ingo; Ize, Santiago

    2015-07-28

    Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.

  10. Invasive tightly coupled processor arrays

    CERN Document Server

    LARI, VAHID

    2016-01-01

    This book introduces new massively parallel computer (MPSoC) architectures called invasive tightly coupled processor arrays. It proposes strategies, architecture designs, and programming interfaces for invasive TCPAs that allow invading and subsequently executing loop programs with strict requirements or guarantees of non-functional execution qualities such as performance, power consumption, and reliability. For the first time, such a configurable processor array architecture consisting of locally interconnected VLIW processing elements can be claimed by programs, either in full or in part, using the principle of invasive computing. Invasive TCPAs provide unprecedented energy efficiency for the parallel execution of nested loop programs by avoiding any global memory access such as GPUs and may even support loops with complex dependencies such as loop-carried dependencies that are not amenable to parallel execution on GPUs. For this purpose, the book proposes different invasion strategies for claiming a desire...

  11. Monoliths in Bioprocess Technology

    Directory of Open Access Journals (Sweden)

    Vignesh Rajamanickam

    2015-04-01

    Full Text Available Monolithic columns are a special type of chromatography column, which can be used for the purification of different biomolecules. They have become popular due to their high mass transfer properties and short purification times. Several articles have already discussed monolith manufacturing, as well as monolith characteristics. In contrast, this review focuses on the applied aspect of monoliths and discusses the most relevant biomolecules that can be successfully purified by them. We describe success stories for viruses, nucleic acids and proteins and compare them to conventional purification methods. Furthermore, the advantages of monolithic columns over particle-based resins, as well as the limitations of monoliths are discussed. With a compilation of commercially available monolithic columns, this review aims at serving as a ‘yellow pages’ for bioprocess engineers who face the challenge of purifying a certain biomolecule using monoliths.

  12. Spaceborne Processor Array

    Science.gov (United States)

    Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

    2008-01-01

    A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

  13. Neurovision processor for designing intelligent sensors

    Science.gov (United States)

    Gupta, Madan M.; Knopf, George K.

    1992-03-01

    A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.

  14. Concept of a Supervector Processor: A Vector Approach to Superscalar Processor, Design and Performance Analysis

    Directory of Open Access Journals (Sweden)

    Deepak Kumar, Ranjan Kumar Behera, K. S. Pandey

    2013-07-01

    Full Text Available To maximize the available performance is always a goal in microprocessor design. In this paper a new technique has been implemented which exploits the advantage of both superscalar and vector processing technique in a proposed processor called Supervector processor. Vector processor operates on array of data called vector and can greatly improve certain task such as numerical simulation and tasks which requires huge number crunching. On other handsuperscalar processor issues multiple instructions per cyclewhich can enhance the throughput. To implement parallelism multiple vector instructions were issued and executed per cycle in superscalar fashion. Case study has been done on various benchmarks to compare the performance of proposedsupervector processor architecture with superscalar and vectorprocessor architecture. Trimaran Framework has been used in order to evaluate the performance of the proposed supervector processor scheme.

  15. Multi-core processors - An overview

    CERN Document Server

    Venu, Balaji

    2011-01-01

    Microprocessors have revolutionized the world we live in and continuous efforts are being made to manufacture not only faster chips but also smarter ones. A number of techniques such as data level parallelism, instruction level parallelism and hyper threading (Intel's HT) already exists which have dramatically improved the performance of microprocessor cores. This paper briefs on evolution of multi-core processors followed by introducing the technology and its advantages in today's world. The paper concludes by detailing on the challenges currently faced by multi-core processors and how the industry is trying to address these issues.

  16. A monolithic integrated photonic microwave filter

    Science.gov (United States)

    Fandiño, Javier S.; Muñoz, Pascual; Doménech, David; Capmany, José

    2016-12-01

    Meeting the increasing demand for capacity in wireless networks requires the harnessing of higher regions in the radiofrequency spectrum, reducing cell size, as well as more compact, agile and power-efficient base stations that are capable of smoothly interfacing the radio and fibre segments. Fully functional microwave photonic chips are promising candidates in attempts to meet these goals. In recent years, many integrated microwave photonic chips have been reported in different technologies. To the best of our knowledge, none has monolithically integrated all the main active and passive optoelectronic components. Here, we report the first demonstration of a tunable microwave photonics filter that is monolithically integrated into an indium phosphide chip. The reconfigurable radiofrequency photonic filter includes all the necessary elements (for example, lasers, modulators and photodetectors), and its response can be tuned by means of control electric currents. This is an important step in demonstrating the feasibility of integrated and programmable microwave photonic processors.

  17. Joint Experimentation on Scalable Parallel Processors (JESPP)

    Science.gov (United States)

    2006-04-01

    on the order of 50 Mb/s to the JESPP. Our personnel assisted J9 in designing, selecting, installing and initializing a “ Beowulf ” Linux cluster at J9...CPUs that are connected by a high speed network. One of the major types of systems run by the HPC centers is the Beowulf cluster . (Sterling et al...1995) Beowulf clusters have become popular systems in the world of HPC systems because of their low cost for the amount of power they provide. A

  18. Acoustooptic linear algebra processors - Architectures, algorithms, and applications

    Science.gov (United States)

    Casasent, D.

    1984-01-01

    Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

  19. Acoustooptic linear algebra processors - Architectures, algorithms, and applications

    Science.gov (United States)

    Casasent, D.

    1984-01-01

    Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

  20. Parallel processing ITS

    Energy Technology Data Exchange (ETDEWEB)

    Fan, W.C.; Halbleib, J.A. Sr.

    1996-09-01

    This report provides a users` guide for parallel processing ITS on a UNIX workstation network, a shared-memory multiprocessor or a massively-parallel processor. The parallelized version of ITS is based on a master/slave model with message passing. Parallel issues such as random number generation, load balancing, and communication software are briefly discussed. Timing results for example problems are presented for demonstration purposes.

  1. Bibliographic Pattern Matching Using the ICL Distributed Array Processor.

    Science.gov (United States)

    Carroll, David M.; And Others

    1988-01-01

    Describes the use of a highly parallel array processor for pattern matching operations in a bibliographic retrieval system. The discussion covers the hardware and software features of the processor, the pattern matching algorithm used, and the results of experimental tests of the system. (37 references) (Author/CLB)

  2. Embedded Processor Laboratory

    Data.gov (United States)

    Federal Laboratory Consortium — The Embedded Processor Laboratory provides the means to design, develop, fabricate, and test embedded computers for missile guidance electronics systems in support...

  3. Quantum Algorithm Processor For Finding Exact Divisors

    OpenAIRE

    Burger, John Robert

    2005-01-01

    Wiring diagrams are given for a quantum algorithm processor in CMOS to compute, in parallel, all divisors of an n-bit integer. Lines required in a wiring diagram are proportional to n. Execution time is proportional to the square of n.

  4. An Experimental Digital Image Processor

    Science.gov (United States)

    Cok, Ronald S.

    1986-12-01

    A prototype digital image processor for enhancing photographic images has been built in the Research Laboratories at Kodak. This image processor implements a particular version of each of the following algorithms: photographic grain and noise removal, edge sharpening, multidimensional image-segmentation, image-tone reproduction adjustment, and image-color saturation adjustment. All processing, except for segmentation and analysis, is performed by massively parallel and pipelined special-purpose hardware. This hardware runs at 10 MHz and can be adjusted to handle any size digital image. The segmentation circuits run at 30 MHz. The segmentation data are used by three single-board computers for calculating the tonescale adjustment curves. The system, as a whole, has the capability of completely processing 10 million three-color pixels per second. The grain removal and edge enhancement algorithms represent the largest part of the pipelined hardware, operating at over 8 billion integer operations per second. The edge enhancement is performed by unsharp masking, and the grain removal is done using a collapsed Walsh-hadamard transform filtering technique (U.S. Patent No. 4549212). These two algo-rithms can be realized using four basic processing elements, some of which have been imple-mented as VLSI semicustom integrated circuits. These circuits implement the algorithms with a high degree of efficiency, modularity, and testability. The digital processor is controlled by a Digital Equipment Corporation (DEC) PDP 11 minicomputer and can be interfaced to electronic printing and/or electronic scanning de-vices. The processor has been used to process over a thousand diagnostic images.

  5. Array processors in chemistry

    Energy Technology Data Exchange (ETDEWEB)

    Ostlund, N.S.

    1980-01-01

    The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

  6. Broadcasting collective operation contributions throughout a parallel computer

    Science.gov (United States)

    Faraj, Ahmad [Rochester, MN

    2012-02-21

    Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

  7. C-slow Technique vs Multiprocessor in designing Low Area Customized Instruction set Processor for Embedded Applications

    CERN Document Server

    Akram, Muhammad Adeel; Sarfaraz, Muhammad Masood

    2012-01-01

    The demand for high performance embedded processors, for consumer electronics, is rapidly increasing for the past few years. Many of these embedded processors depend upon custom built Instruction Ser Architecture (ISA) such as game processor (GPU), multimedia processors, DSP processors etc. Primary requirement for consumer electronic industry is low cost with high performance and low power consumption. A lot of research has been evolved to enhance the performance of embedded processors through parallel computing. But some of them focus superscalar processors i.e. single processors with more resources like Instruction Level Parallelism (ILP) which includes Very Long Instruction Word (VLIW) architecture, custom instruction set extensible processor architecture and others require more number of processing units on a single chip like Thread Level Parallelism (TLP) that includes Simultaneous Multithreading (SMT), Chip Multithreading (CMT) and Chip Multiprocessing (CMP). In this paper, we present a new technique, n...

  8. SPROC: A multiple-processor DSP IC

    Science.gov (United States)

    Davis, R.

    1991-01-01

    A large, single-chip, multiple-processor, digital signal processing (DSP) integrated circuit (IC) fabricated in HP-Cmos34 is presented. The innovative architecture is best suited for analog and real-time systems characterized by both parallel signal data flows and concurrent logic processing. The IC is supported by a powerful development system that transforms graphical signal flow graphs into production-ready systems in minutes. Automatic compiler partitioning of tasks among four on-chip processors gives the IC the signal processing power of several conventional DSP chips.

  9. Development of a Massively Parallel NOGAPS Forecast Model

    Science.gov (United States)

    2016-06-07

    parallel computer architectures. These algorithms will be critical for inter- processor communication dependent and computationally intensive model...to exploit massively parallel processor (MPP), distributed memory computer architectures. Future increases in computer power from MPP’s will allow...passing (MPI) is the paradigm chosen for communication between distributed memory processors. APPROACH Use integrations of the current operational

  10. The Parallel C Preprocessor

    Directory of Open Access Journals (Sweden)

    Eugene D. Brooks III

    1992-01-01

    Full Text Available We describe a parallel extension of the C programming language designed for multiprocessors that provide a facility for sharing memory between processors. The programming model was initially developed on conventional shared memory machines with small processor counts such as the Sequent Balance and Alliant FX/8, but has more recently been used on a scalable massively parallel machine, the BBN TC2000. The programming model is split-join rather than fork-join. Concurrency is exploited to use a fixed number of processors more efficiently rather than to exploit more processors as in the fork-join model. Team splitting, a mechanism to split the team of processors executing a code into subteams to handle parallel subtasks, is used to provide an efficient mechanism to exploit nested concurrency. We have found the split-join programming model to have an inherent implementation advantage, compared to the fork-join model, when the number of processors in a machine becomes large.

  11. Introduction to Parallel Computing

    Science.gov (United States)

    1992-05-01

    Topology C, Ada, C++, Data-parallel FORTRAN, 2D mesh of node boards, each node FORTRAN-90 (late 1992) board has 1 application processor Devopment Tools ...parallel machines become the wave of the present, tools are increasingly needed to assist programmers in creating parallel tasks and coordinating...their activities. Linda was designed to be such a tool . Linda was designed with three important goals in mind: to be portable, efficient, and easy to use

  12. Parallel Programming with Intel Parallel Studio XE

    CERN Document Server

    Blair-Chappell , Stephen

    2012-01-01

    Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the

  13. Selection and integration of a network of parallel processors in the real time acquisition system of the 4{pi} DIAMANT multidetector: modeling, realization and evaluation of the software installed on this network; Choix et integration d`un reseau de processeurs paralleles dans le systeme d`acquisition temps reel du multidetecteur 4{pi} DIAMANT: modelisation, realisation et evaluation du logiciel implante sur ce reseau

    Energy Technology Data Exchange (ETDEWEB)

    Guirande, F. [Ecole Doctorale des Sciences Physiques et de l`Ingenieur, Bordeaux-1 Univ., 33 (France)

    1997-07-11

    The increase in sensitivity of 4{pi} arrays such as EUROBALL or DIAMANT has led to an increase in the data flow rate into the data acquisition system. If at the electronic level, the data flow has been distributed onto several data acquisition buses, it is necessary in the data processing system to increase the processing power. This work regards the modelling and implementation of the software allocated onto an architecture of parallel processors. Object analysis and formal methods were used, benchmark and evolution in the future of this architecture are presented. The thesis consists of two parts. Part A, devoted to `Nuclear Spectroscopy with 4 {pi} multidetectors`, contains a first chapter entitled `The Physics of 4{pi} multidetectors` and a second chapter entitled `Integral architecture of 4{pi} multidetectors`. Part B, devoted to `Parallel acquisition system of DIAMANT` contains three chapters entitled `Material architecture`, `Software architecture` and `Validation and Performances`. Four appendices and a term glossary close this work. (author) 58 refs.

  14. Dynamic Load Balancing using Graphics Processors

    Directory of Open Access Journals (Sweden)

    R Mohan

    2014-04-01

    Full Text Available To get maximum performance on the many-core graphics processors, it is important to have an even balance of the workload so that all processing units contribute equally to the task at hand. This can be hard to achieve when the cost of a task is not known beforehand and when new sub-tasks are created dynamically during execution. Both the dynamic load balancing methods using Static task assignment and work stealing using deques are compared to see which one is more suited to the highly parallel world of graphics processors. They have been evaluated on the task of simulating a computer move against the human move, in the famous four in a row game. The experiments showed that synchronization can be very expensive, and those new methods which use graphics processor features wisely might be required.

  15. Programmable DNA-mediated multitasking processor

    CERN Document Server

    Shu, Jian-Jun; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin

    2015-01-01

    Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices.

  16. A parallel buffer tree

    DEFF Research Database (Denmark)

    Sitchinava, Nodar; Zeh, Norbert

    2012-01-01

    We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number...... of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism. The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of N insertions, deletions, membership queries, and range queries...

  17. Benchmarking a DSP processor

    OpenAIRE

    Lennartsson, Per; Nordlander, Lars

    2002-01-01

    This Master thesis describes the benchmarking of a DSP processor. Benchmarking means measuring the performance in some way. In this report, we have focused on the number of instruction cycles needed to execute certain algorithms. The algorithms we have used in the benchmark are all very common in signal processing today. The results we have reached in this thesis have been compared to benchmarks for other processors, performed by Berkeley Design Technology, Inc. The algorithms were programm...

  18. A hybrid two-stage flexible flowshop scheduling problem with m identical parallel machines and a burn-in processor separately%组成于平行机和批处理机的二阶段混合流水作业问题

    Institute of Scientific and Technical Information of China (English)

    何龙敏; 孙世杰

    2007-01-01

    A hybrid two-stage flowshop scheduling problem was considered which involves m identical parallel machines at Stage 1 and a burn-in processor M at Stage 2, and the makespan was taken as the minimization objective. This scheduling problem is NP-hard in general. We divide it into eight subcases. Except for the following two subcases: (1) b ≥ an, max{m, B} < n;(2) a1 ≤ b ≤ an, m ≤B < n, for all other subcases, their NP-hardness was proved or pointed out, corresponding approximation algorithms were conducted and their worst-case performances were estimated. In all these approximation algorithms, the Multifit and PTAS algorithms were respectively used, as the jobs were scheduled in m identical parallel machines.

  19. High-Speed General Purpose Genetic Algorithm Processor.

    Science.gov (United States)

    Hoseini Alinodehi, Seyed Pourya; Moshfe, Sajjad; Saber Zaeimian, Masoumeh; Khoei, Abdollah; Hadidi, Khairollah

    2016-07-01

    In this paper, an ultrafast steady-state genetic algorithm processor (GAP) is presented. Due to the heavy computational load of genetic algorithms (GAs), they usually take a long time to find optimum solutions. Hardware implementation is a significant approach to overcome the problem by speeding up the GAs procedure. Hence, we designed a digital CMOS implementation of GA in [Formula: see text] process. The proposed processor is not bounded to a specific application. Indeed, it is a general-purpose processor, which is capable of performing optimization in any possible application. Utilizing speed-boosting techniques, such as pipeline scheme, parallel coarse-grained processing, parallel fitness computation, parallel selection of parents, dual-population scheme, and support for pipelined fitness computation, the proposed processor significantly reduces the processing time. Furthermore, by relying on a built-in discard operator the proposed hardware may be used in constrained problems that are very common in control applications. In the proposed design, a large search space is achievable through the bit string length extension of individuals in the genetic population by connecting the 32-bit GAPs. In addition, the proposed processor supports parallel processing, in which the GAs procedure can be run on several connected processors simultaneously.

  20. Monolithic microwave integrated circuits

    Science.gov (United States)

    Pucel, R. A.

    Monolithic microwave integrated circuits (MMICs), a new microwave technology which is expected to exert a profound influence on microwave circuit designs for future military systems as well as for the commercial and consumer markets, is discussed. The book contains an historical discussion followed by a comprehensive review presenting the current status in the field. The general topics of the volume are: design considerations, materials and processing considerations, monolithic circuit applications, and CAD, measurement, and packaging techniques. All phases of MMIC technology are covered, from design to testing.

  1. Processor Management in the Tera MTA Computer System,

    Science.gov (United States)

    1993-01-01

    This paper describes the processor scheduling issues specific to the Tera MTA (Multi Threaded Architecture) computer system and presents solutions to...classic scheduling problems. The Tera MTA exploits parallelism at all levels, from fine-grained instruction-level parallelism within a single

  2. Fast 2D-DCT implementations for VLIW processors

    OpenAIRE

    Sohm, OP; Canagarajah, CN; Bull, DR

    1999-01-01

    This paper analyzes various fast 2D-DCT algorithms regarding their suitability for VLIW processors. Operations for truncation or rounding which are usually neglected in proposals for fast algorithms have also been taken into consideration. Loeffler's algorithm with parallel multiplications was found to be most suitable due to its parallel structure

  3. Signal processor packaging design

    Science.gov (United States)

    McCarley, Paul L.; Phipps, Mickie A.

    1993-10-01

    The Signal Processor Packaging Design (SPPD) program was a technology development effort to demonstrate that a miniaturized, high throughput programmable processor could be fabricated to meet the stringent environment imposed by high speed kinetic energy guided interceptor and missile applications. This successful program culminated with the delivery of two very small processors, each about the size of a large pin grid array package. Rockwell International's Tactical Systems Division in Anaheim, California developed one of the processors, and the other was developed by Texas Instruments' (TI) Defense Systems and Electronics Group (DSEG) of Dallas, Texas. The SPPD program was sponsored by the Guided Interceptor Technology Branch of the Air Force Wright Laboratory's Armament Directorate (WL/MNSI) at Eglin AFB, Florida and funded by SDIO's Interceptor Technology Directorate (SDIO/TNC). These prototype processors were subjected to rigorous tests of their image processing capabilities, and both successfully demonstrated the ability to process 128 X 128 infrared images at a frame rate of over 100 Hz.

  4. APRON: A Cellular Processor Array Simulation and Hardware Design Tool

    Directory of Open Access Journals (Sweden)

    David R. W. Barr

    2009-01-01

    Full Text Available We present a software environment for the efficient simulation of cellular processor arrays (CPAs. This software (APRON is used to explore algorithms that are designed for massively parallel fine-grained processor arrays, topographic multilayer neural networks, vision chips with SIMD processor arrays, and related architectures. The software uses a highly optimised core combined with a flexible compiler to provide the user with tools for the design of new processor array hardware architectures and the emulation of existing devices. We present performance benchmarks for the software processor array implemented on standard commodity microprocessors. APRON can be configured to use additional processing hardware if necessary and can be used as a complete graphical user interface and development environment for new or existing CPA systems, allowing more users to develop algorithms for CPA systems.

  5. PERFORMANCE EVALUATION OF DIRECT PROCESSOR ACCESS FOR NON DEDICATED SERVER

    Directory of Open Access Journals (Sweden)

    P. S. BALAMURUGAN

    2010-10-01

    Full Text Available The objective of the paper is to design a co processor for a desktop machine which enables the machine to act as non dedicated server, such that the co processor will act as a server processor and the multi-core processor to act as desktop processor. By implementing this methodology a client machine can be made to act as a non dedicated server and a client machine. These type of machine can be used in autonomy networks. This design will lead to design of a cost effective server and machine which can parallel act as a non dedicated server and a client machine or it can be made to switch and act as client or server.

  6. Issue Mechanism for Embedded Simultaneous Multithreading Processor

    Science.gov (United States)

    Zang, Chengjie; Imai, Shigeki; Frank, Steven; Kimura, Shinji

    Simultaneous Multithreading (SMT) technology enhances instruction throughput by issuing multiple instructions from multiple threads within one clock cycle. For in-order pipeline to each thread, SMT processors can provide large number of issued instructions close to or surpass than using out-of-order pipeline. In this work, we show an efficient issue logic for predicated instruction sequence with the parallel flag in each instruction, where the predicate register based issue control is adopted and the continuous instructions with the parallel flag of ‘0’ are executed in parallel. The flag is pre-defined by a compiler. Instructions from different threads are issued based on the round-robin order. We also introduce an Instruction Queue skip mechanism for thread if the queue is empty. Using this kind of issue logic, we designed a 6 threads, 7-stage, in-order pipeline processor. Based on this processor, we compare round-robin issue policy (RR(T1-Tn)) with other policies: thread one always has the highest priority (PR(T1)) and thread one or thread n has the highest priority in turn (PR(T1-Tn)). The results show that RR(T1-Tn) policy outperforms others and PR(T1-Tn) is almost the same to RR(T1-Tn) from the point of view of the issued instructions per cycle.

  7. Embedded-monolith armor

    Science.gov (United States)

    McElfresh, Michael W.; Groves, Scott E; Moffet, Mitchell L.; Martin, Louis P.

    2016-07-19

    A lightweight armor system utilizing a face section having a multiplicity of monoliths embedded in a matrix supported on low density foam. The face section is supported with a strong stiff backing plate. The backing plate is mounted on a spall plate.

  8. The Milstar Advanced Processor

    Science.gov (United States)

    Tjia, Khiem-Hian; Heely, Stephen D.; Morphet, John P.; Wirick, Kevin S.

    The Milstar Advanced Processor (MAP) is a 'drop-in' replacement for its predecessor which preserves existing interfaces with other Milstar satellite processors and minimizes the impact of such upgrading to already-developed application software. In addition to flight software development, and hardware development that involves the application of VHSIC technology to the electrical design, the MAP project is developing two sophisticated and similar test environments. High density RAM and ROM are employed by the MAP memory array. Attention is given to the fine-pitch VHSIC design techniques and lead designs used, as well as the tole of TQM and concurrent engineering in the development of the MAP manufacturing process.

  9. Parallel External Memory Graph Algorithms

    DEFF Research Database (Denmark)

    Arge, Lars Allan; Goodrich, Michael T.; Sitchinava, Nodari

    2010-01-01

    In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one o f the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest...... an optimal speedup of ¿(P) in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts....

  10. Hetrogenous Parallel Computing

    OpenAIRE

    2013-01-01

    With processor core counts doubling every 18-24 months and penetrating all markets from high-end servers in supercomputers to desktops and laptops down to even mobile phones, we sit at the dawn of a world of ubiquitous parallelism, one where extracting performance via parallelism is paramount. That is, the "free lunch" to better performance, where programmers could rely on substantial increases in single-threaded performance to improve software, is over. The burden falls on developers to expl...

  11. Practical parallel programming

    CERN Document Server

    Bauer, Barr E

    2014-01-01

    This is the book that will teach programmers to write faster, more efficient code for parallel processors. The reader is introduced to a vast array of procedures and paradigms on which actual coding may be based. Examples and real-life simulations using these devices are presented in C and FORTRAN.

  12. Interactive Digital Signal Processor

    Science.gov (United States)

    Mish, W. H.

    1985-01-01

    Interactive Digital Signal Processor, IDSP, consists of set of time series analysis "operators" based on various algorithms commonly used for digital signal analysis. Processing of digital signal time series to extract information usually achieved by applications of number of fairly standard operations. IDSP excellent teaching tool for demonstrating application for time series operators to artificially generated signals.

  13. Beyond processor sharing

    NARCIS (Netherlands)

    Aalto, S.; Ayesta, U.; Borst, S.C.; Misra, V.; Núñez Queija, R.

    2007-01-01

    While the (Egalitarian) Processor-Sharing (PS) discipline offers crucial insights in the performance of fair resource allocation mechanisms, it is inherently limited in analyzing and designing differentiated scheduling algorithms such as Weighted Fair Queueing and Weighted Round-Robin. The Discrimin

  14. The Central Trigger Processor (CTP)

    CERN Multimedia

    Franchini, Matteo

    2016-01-01

    The Central Trigger Processor (CTP) receives trigger information from the calorimeter and muon trigger processors, as well as from other sources of trigger. It makes the Level-1 decision (L1A) based on a trigger menu.

  15. A Domain Specific DSP Processor

    OpenAIRE

    Tell, Eric

    2001-01-01

    This thesis describes the design of a domain specific DSP processor. The thesis is divided into two parts. The first part gives some theoretical background, describes the different steps of the design process (both for DSP processors in general and for this project) and motivates the design decisions made for this processor. The second part is a nearly complete design specification. The intended use of the processor is as a platform for hardware acceleration units. Support for this has howe...

  16. Processor register error correction management

    Science.gov (United States)

    Bose, Pradip; Cher, Chen-Yong; Gupta, Meeta S.

    2016-12-27

    Processor register protection management is disclosed. In embodiments, a method of processor register protection management can include determining a sensitive logical register for executable code generated by a compiler, generating an error-correction table identifying the sensitive logical register, and storing the error-correction table in a memory accessible by a processor. The processor can be configured to generate a duplicate register of the sensitive logical register identified by the error-correction table.

  17. Dual-core Itanium Processor

    CERN Multimedia

    2006-01-01

    Intel’s first dual-core Itanium processor, code-named "Montecito" is a major release of Intel's Itanium 2 Processor Family, which implements the Intel Itanium architecture on a dual-core processor with two cores per die (integrated circuit). Itanium 2 is much more powerful than its predecessor. It has lower power consumption and thermal dissipation.

  18. Monolithic MACS micro resonators

    Science.gov (United States)

    Lehmann-Horn, J. A.; Jacquinot, J.-F.; Ginefri, J. C.; Bonhomme, C.; Sakellariou, D.

    2016-10-01

    Magic Angle Coil Spinning (MACS) aids improving the intrinsically low NMR sensitivity of heterogeneous microscopic samples. We report on the design and testing of a new type of monolithic 2D MACS resonators to overcome known limitations of conventional micro coils. The resonators' conductors were printed on dielectric substrate and tuned without utilizing lumped element capacitors. Self-resonance conditions have been computed by a hybrid FEM-MoM technique. Preliminary results reported here indicate robust mechanical stability, reduced eddy currents heating and negligible susceptibility effects. The gain in B1 /√{ P } is in agreement with the NMR sensitivity enhancement according to the principle of reciprocity. A sensitivity enhancement larger than 3 has been achieved in a monolithic micro resonator inside a standard 4 mm rotor at 500 MHz. These 2D resonators could offer higher performance micro-detection and ease of use of heterogeneous microscopic substances such as biomedical samples, microscopic specimens and thin film materials.

  19. The MONOLITH prototype

    CERN Document Server

    Ambrosio, M; Bencivenni, G; Candela, A M; Chiarini, A; Chignoli, F; De Deo, M; D'Incecco, M; Gerli, S; Giusti, P; Gómez, F; Gustavino, C; Lindozzi, M; Mannocchi, G; Menghetti, H; Morello, C; Murtas, F; Paoluzzi, G; Pilastrini, R; Redaelli, N G; Santoni, M; Sartorelli, G; Terranova, F; Trinchero, G C

    2000-01-01

    MONOLITH (Massive Observatory for Neutrino Oscillation or LImits on THeir existence) is the project of an experiment to study atmospheric neutrino oscillations with a massive magnetized iron detector. The baseline option is a 34 kt iron detector based on the use of about 50000 m/sup 2/ of the glass Resistive Plate Chambers (glass RPCs) developed at the Laboratori Nazionali del Gran Sasso (LNGS). An 8 ton prototype equipped with 23 m/sup 2/ of glass RPC has been realized and tested at the T7-PS beam at CERN. The energy resolution for pions follows a 68%/ square root (E(GeV))+2% law for orthogonally incident particles, in the energy range between 2 and 10 GeV. The time resolution and the tracking capability of the glass RPC are suitable for the MONOLITH experiment. (7 refs).

  20. New Generation Processor Architecture Research

    Institute of Scientific and Technical Information of China (English)

    Chen Hongsong(陈红松); Hu Mingzeng; Ji Zhenzhou

    2003-01-01

    With the rapid development of microelectronics and hardware,the use of ever faster micro-processors and new architecture must be continued to meet tomorrow′s computing needs. New processor microarchitectures are needed to push performance further and to use higher transistor counts effectively.At the same time,aiming at different usages,the processor has been optimized in different aspects,such as high performace,low power consumption,small chip area and high security. SOC (System on chip)and SCMP (Single Chip Multi Processor) constitute the main processor system architecture.

  1. Stereoscopic Optical Signal Processor

    Science.gov (United States)

    Graig, Glenn D.

    1988-01-01

    Optical signal processor produces two-dimensional cross correlation of images from steroscopic video camera in real time. Cross correlation used to identify object, determines distance, or measures movement. Left and right cameras modulate beams from light source for correlation in video detector. Switch in position 1 produces information about range of object viewed by cameras. Position 2 gives information about movement. Position 3 helps to identify object.

  2. Conversion via software of a simd processor into a mimd processor

    Energy Technology Data Exchange (ETDEWEB)

    Guzman, A.; Gerzso, M.; Norkin, K.B.; Vilenkin, S.Y.

    1983-01-01

    A method is described which takes a pure LISP program and automatically decomposes it via automatic parallelization into several parts, one for each processor of an SIMD architecture. Each of these parts is a different execution flow, i.e., a different program. The execution of these different programs by an SIMD architecture is examined. The method has been developed in some detail for the PS-2000, an SIMD Soviet multiprocessor, making it behave like AHR, a Mexican MIMD multi-microprocessor. Both the PS-2000 and AHR execute a pure LISP program in parallel; its decomposition into >n> pieces, their synchronization, scheduling, etc., are performed by the system (hardware and software). In order to achieve simultaneous execution of different programs in an SIMD processor, the method uses a scheme of node scheduling and node exportation. 14 references.

  3. Systematic approach for deriving feasible mappings of parallel algorithms to parallel computing platforms

    NARCIS (Netherlands)

    Arkin, Ethem; Tekinerdogan, Bedir; Imre, Kayhan M.

    2017-01-01

    The need for high-performance computing together with the increasing trend from single processor to parallel computer architectures has leveraged the adoption of parallel computing. To benefit from parallel computing power, usually parallel algorithms are defined that can be mapped and executed

  4. Systematic approach for deriving feasible mappings of parallel algorithms to parallel computing platforms

    NARCIS (Netherlands)

    Arkin, Ethem; Tekinerdogan, Bedir; Imre, Kayhan M.

    2016-01-01

    The need for high-performance computing together with the increasing trend from single processor to parallel computer architectures has leveraged the adoption of parallel computing. To benefit from parallel computing power, usually parallel algorithms are defined that can be mapped and executed o

  5. ADAPTATION OF PARALLEL VIRTUAL MACHINES MECHANISMS TO PARALLEL SYSTEMS

    Directory of Open Access Journals (Sweden)

    Zafer DEMİR

    2001-02-01

    Full Text Available In this study, at first, Parallel Virtual Machine is reviewed. Since It is based upon parallel processing, it is similar to parallel systems in principle in terms of architecture. Parallel Virtual Machine is neither an operating system nor a programming language. It is a specific software tool that supports heterogeneous parallel systems. However, it takes advantage of the features of both to make users close to parallel systems. Since tasks can be executed in parallel on parallel systems by Parallel Virtual Machine, there is an important similarity between PVM and distributed systems and multiple processors. In this study, the relations in question are examined by making use of Master-Slave programming technique. In conclusion, the PVM is tested with a simple factorial computation on a distributed system to observe its adaptation to parallel architects.

  6. Bioaffinity chromatography on monolithic supports

    NARCIS (Netherlands)

    Tetala, K.K.R.; Beek, van T.A.

    2010-01-01

    Affinity chromatography on monolithic supports is a powerful analytical chemical platform because it allows for fast analyses, small sample volumes, strong enrichment of trace biomarkers and applications in microchips. In this review, the recent research using monolithic materials in the field of bi

  7. Bioaffinity chromatography on monolithic supports

    NARCIS (Netherlands)

    Tetala, K.K.R.; Beek, van T.A.

    2010-01-01

    Affinity chromatography on monolithic supports is a powerful analytical chemical platform because it allows for fast analyses, small sample volumes, strong enrichment of trace biomarkers and applications in microchips. In this review, the recent research using monolithic materials in the field of bi

  8. Distributed processor allocation for launching applications in a massively connected processors complex

    Science.gov (United States)

    Pedretti, Kevin

    2008-11-18

    A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

  9. Design of monoliths through their mechanical properties.

    Science.gov (United States)

    Podgornik, Aleš; Savnik, Aleš; Jančar, Janez; Krajnc, Nika Lendero

    2014-03-14

    Chromatographic monoliths have several interesting properties making them attractive supports for analytics but also for purification, especially of large biomolecules and bioassemblies. Although many of monolith features were thoroughly investigated, there is no data available to predict how monolith mechanical properties affect its chromatographic performance. In this work, we investigated the effect of porosity, pore size and chemical modification on methacrylate monolith compression modulus. While a linear correlation between pore size and compression modulus was found, the effect of porosity was highly exponential. Through these correlations it was concluded that chemical modification affects monolith porosity without changing the monolith skeleton integrity. Mathematical model to describe the change of monolith permeability as a function of monolith compression modulus was derived and successfully validated for monoliths of different geometries and pore sizes. It enables the prediction of pressure drop increase due to monolith compressibility for any monolith structural characteristics, such as geometry, porosity, pore size or mobile phase properties like viscosity or flow rate, based solely on the data of compression modulus and structural data of non-compressed monolith. Furthermore, it enables simple determination of monolith pore size at which monolith compressibility is the smallest and the most robust performance is expected. Data of monolith compression modulus in combination with developed mathematical model can therefore be used for the prediction of monolith permeability during its implementation but also to accelerate the design of novel chromatographic monoliths with desired hydrodynamic properties for particular application.

  10. Introducing the PilGRIM: A Processor for Executing Lazy Functional Languages

    NARCIS (Netherlands)

    Boeijink, Arjan; Hölzenspies, Philip K.F.; Kuper, Jan; Hagen, Jurriaan; Morazán, Marco T.

    2011-01-01

    Processor designs specialized for functional languages received very little attention in the past 20 years. The potential for exploiting more parallelism and the developments in hardware technology, ask for renewed investigation of this topic. In this paper, we use ideas from modern processor archit

  11. Parallelism and Scalability in an Image Processing Application

    DEFF Research Database (Denmark)

    Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven

    2009-01-01

    The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chips. This means that parallel processing is required in application areas that traditionally have not used...... parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately...... available parallelism and further extraction of parallelism is limited by small data sets and a relatively high parallelization overhead. Load balance is difficult to obtain due to the limited parallelism and made worse by non-uniform memory latency. Three parallel OpenMP implementations of the application...

  12. Parallel optimization of multi-view deblurring algorithm in dual-core digital signal processor%多视点去模糊算法在双核DSP上的并行优化

    Institute of Scientific and Technical Information of China (English)

    付航; 章秀华; 贺武

    2015-01-01

    To rea1ize fast running of the mu1ti-view deb1urring a1gorithm in sma11 devices, apara11e1 optimiza-tion a1gorithm was proposed.TMS320C6657 dua1-core digita1 signa1 processor (DSP) was used as the primary computing chip, and the CCSv5.2 was used as the software deve1opment environment. To so1ve the prob1em of time consume of the sing1e core, the time of each sub-function in the a1gorithm was ana1yzed by using the time stamp counter. Then the a1gorithm of matrix mu1tip1ication that consumes the 1ongest time in the sub-function was optimized by dividing into two parts using a dividing point, and the amount of ca1cu1ation was assigned to the two cores of DSP equa11y to ensure the a1gorithm run in para11e1. The resu1ts show that the so1ving method of the dividing point is correct and effective; the running time of the DSP is reduced signifi-cantly by the optimized a1gorithm of mu1ti-view deb1urring and the operation efficiency is improved.%为了实现多视点去模糊算法在小型设备上快速运行,提出了一种并行优化的方法.采用TMS320C6657双核数字信号处理器(DSP,Digita1 signa1 processor)作为主要运算芯片,使用CCSv5.2作为软件开发环境.为了解决算法在单核上运行时间长的问题,首先使用时间戳计数器对算法中各部分功能函数的运行时间进行了详细的统计和分析;然后将运行时间最长的子函数中矩阵相乘部分的算法进行了优化,采用一个分界点将矩阵相乘部分算法划分为两块,将计算量均等的分配到DSP的两个核心上,使这部分算法能够同时在两个核心上并行运算.结果表明对分界点的求解是正确有效的;优化后的图像去模糊算法极大的缩短了DSP上的运算时间,提高了运算效率.

  13. Porous polymer monolithic col

    Directory of Open Access Journals (Sweden)

    Lydia Terborg

    2015-05-01

    Full Text Available A new approach has been developed for the preparation of mixed-mode stationary phases to separate proteins. The pore surface of monolithic poly(glycidyl methacrylate-co-ethylene dimethacrylate capillary columns was functionalized with thiols and coated with gold nanoparticles. The final mixed mode surface chemistry was formed by attaching, in a single step, alkanethiols, mercaptoalkanoic acids, and their mixtures on the free surface of attached gold nanoparticles. Use of these mixtures allowed fine tuning of the hydrophobic/hydrophilic balance. The amount of attached gold nanoparticles according to thermal gravimetric analysis was 44.8 wt.%. This value together with results of frontal elution enabled calculation of surface coverage with the alkanethiol and mercaptoalkanoic acid ligands. Interestingly, alkanethiols coverage in a range of 4.46–4.51 molecules/nm2 significantly exceeded that of mercaptoalkanoic acids with 2.39–2.45 molecules/nm2. The mixed mode character of these monolithic stationary phases was for the first time demonstrated in the separations of proteins that could be achieved in the same column using gradient elution conditions typical of reverse phase (using gradient of acetonitrile in water and ion exchange chromatographic modes (applying gradient of salt in water, respectively.

  14. AMD's 64-bit Opteron processor

    CERN Document Server

    CERN. Geneva

    2003-01-01

    This talk concentrates on issues that relate to obtaining peak performance from the Opteron processor. Compiler options, memory layout, MPI issues in multi-processor configurations and the use of a NUMA kernel will be covered. A discussion of recent benchmarking projects and results will also be included.BiographiesDavid RichDavid directs AMD's efforts in high performance computing and also in the use of Opteron processors...

  15. Emerging Trends in Embedded Processors

    Directory of Open Access Journals (Sweden)

    Gurvinder Singh

    2014-05-01

    Full Text Available An Embedded Processors is simply a µProcessors that has been “Embedded” into a device. Embedded systems are important part of human life. For illustration, one cannot visualize life without mobile phones for personal communication. Embedded systems are used in many places like healthcare, automotive, daily life, and in different offices and industries.Embedded Processors develop new research area in the field of hardware designing.

  16. Adapting algorithms to massively parallel hardware

    CERN Document Server

    Sioulas, Panagiotis

    2016-01-01

    In the recent years, the trend in computing has shifted from delivering processors with faster clock speeds to increasing the number of cores per processor. This marks a paradigm shift towards parallel programming in which applications are programmed to exploit the power provided by multi-cores. Usually there is gain in terms of the time-to-solution and the memory footprint. Specifically, this trend has sparked an interest towards massively parallel systems that can provide a large number of processors, and possibly computing nodes, as in the GPUs and MPPAs (Massively Parallel Processor Arrays). In this project, the focus was on two distinct computing problems: k-d tree searches and track seeding cellular automata. The goal was to adapt the algorithms to parallel systems and evaluate their performance in different cases.

  17. Monolitni katalizatori i reaktori: osnovne značajke, priprava i primjena (Monolith catalysts and reactors: preparation and applications

    Directory of Open Access Journals (Sweden)

    Tomašić, V.

    2004-12-01

    Full Text Available Monolithic (honeycomb catalysts are continuous unitary structures containing many narrow, parallel and usually straight channels (or passages. Catalytically active components are dispersed uniformly over the whole porous ceramic monolith structure (so-called incorporated monolithic catalysts or are in a layer of porous material that is deposited on the walls of channels in the monolith's structure (washcoated monolithic catalysts. The material of the main monolithic construction is not limited to ceramics but includes metals, as well. Monolithic catalysts are commonly used in gas phase catalytic processes, such as treatment of automotive exhaust gases, selective catalytic reduction of nitrogen oxides, catalytic removal of volatile organic compounds from industrial processes, etc. Monoliths continue to be the preferred support for environmental applications due to their high geometric surface area, different design options, low pressure drop, high temperature durability, mechanical strength, ease of orientation in a reactor and effectiveness as a support for a catalytic washcoat. As known, monolithic catalysts belong to the class of the structured catalysts and/or reactors (in some cases the distinction between "catalyst" and "reactor" has vanished. Structured catalysts can greatly intensify chemical processes, resulting in smaller, safer, cleaner and more energy efficient technologies. Monolith reactors can be considered as multifunctional reactors, in which chemical conversion is advantageously integrated with another unit operation, such as separation, heat exchange, a secondary reaction, etc. Finally, structured catalysts and/or reactors appear to be one of the most significant and promising developments in the field of heterogeneous catalysis and chemical engineering of the recent years. This paper gives a description of the background and perspectives for application and development of monolithic materials. Different methods and techniques

  18. All-optical digital processor based on harmonic generation phenomena

    Science.gov (United States)

    Shcherbakov, Alexandre S.; Rakovsky, Vsevolod Y.

    1990-07-01

    Digital optical processors are designed to combine ultra- parallel data procesing capabilities of optical aystems cnd high accur&cy of performed computations. The ultimate limit of the processing rate can be anticipated from all-optical parcllel erchitecturea based on networks o logic gates using materials exibiting strong electronic nonlinearities with response times less than 1O seconds1.

  19. Effective Utilization of Multicore Processor for Unified Threat Management Functions

    Directory of Open Access Journals (Sweden)

    Radhakrishnan Shanmugasundaram

    2012-01-01

    Full Text Available Problem statement: Multicore and multithreaded CPUs have become the new approach for increase in the performance of the processor based systems. Numerous applications benefit from use of multiple cores. Unified threat management is one such application that has multiple functions to be implemented at high speeds. Increasing performance of the system by knowing the nature of the functionality and effective utilization of multiple processors for each of the functions warrants detailed experimentation. In this study, some of the functions of Unified Threat Management are implemented using multiple processors for each of the functions. Approach: This evaluation was conducted on SunfireT1000 server having Sun Ultras ARC T1 multicore processor. OpenMP parallelization methods are used for scheduling the logical CPUs for the parallelized application. Results: Execution time for some of the UTM functions implemented was analyzed to arrive at an effective allocation and parallelization methodology that is dependent on the hardware and the workload. Conclusion/Recommendations: Based on the analysis, the type of parallelization method for the implemented UTM functions are suggested.

  20. ADVANCES AT A GLANCE IN PARALLEL COMPUTING

    Directory of Open Access Journals (Sweden)

    RAJKUMAR SHARMA

    2014-07-01

    Full Text Available In the history of computational world, sequential uni-processor computers have been exploited for years to solve scientific and business problems. To satisfy the demand of compute & data hungry applications, it was observed that better response time can be achieved only through parallelism. Large computational problems were partitioned and solved by using multiple CPUs in parallel. Computing performance was further improved by adopting multi-core architecture which provides hardware parallelism through use of multiple cores. Efficient resource utilization of a parallel computing environment by using software and hardware parallelism is a major research challenge. The present hardware technologies provide freedom to algorithm developers for control & management of resources through software codes, such as threads-to-cores mapping in recent multi-core processors. In this paper, a survey is presented since beginning of parallel computing up to the use of present state-of-art multi-core processors.

  1. Biobased monoliths for adenovirus purification.

    Science.gov (United States)

    Fernandes, Cláudia S M; Gonçalves, Bianca; Sousa, Margarida; Martins, Duarte L; Barroso, Telma; Pina, Ana Sofia; Peixoto, Cristina; Aguiar-Ricardo, Ana; Roque, A Cecília A

    2015-04-01

    Adenoviruses are important platforms for vaccine development and vectors for gene therapy, increasing the demand for high titers of purified viral preparations. Monoliths are macroporous supports regarded as ideal for the purification of macromolecular complexes, including viral particles. Although common monoliths are based on synthetic polymers as methacrylates, we explored the potential of biopolymers processed by clean technologies to produce monoliths for adenovirus purification. Such an approach enables the development of disposable and biodegradable matrices for bioprocessing. A total of 20 monoliths were produced from different biopolymers (chitosan, agarose, and dextran), employing two distinct temperatures during the freezing process (-20 °C and -80 °C). The morphological and physical properties of the structures were thoroughly characterized. The monoliths presenting higher robustness and permeability rates were further analyzed for the nonspecific binding of Adenovirus serotype 5 (Ad5) preparations. The matrices presenting lower nonspecific Ad5 binding were further functionalized with quaternary amine anion-exchange ligand glycidyltrimethylammonium chloride hydrochloride by two distinct methods, and their performance toward Ad5 purification was assessed. The monolith composed of chitosan and poly(vinyl) alcohol (50:50) prepared at -80 °C allowed 100% recovery of Ad5 particles bound to the support. This is the first report of the successful purification of adenovirus using monoliths obtained from biopolymers processed by clean technologies.

  2. Monolithic microchannel heatsink

    Science.gov (United States)

    Benett, William J.; Beach, Raymond J.; Ciarlo, Dino R.

    1996-01-01

    A silicon wafer has slots sawn in it that allow diode laser bars to be mounted in contact with the silicon. Microchannels are etched into the back of the wafer to provide cooling of the diode bars. To facilitate getting the channels close to the diode bars, the channels are rotated from an angle perpendicular to the diode bars which allows increased penetration between the mounted diode bars. This invention enables the fabrication of monolithic silicon microchannel heatsinks for laser diodes. The heatsinks have low thermal resistance because of the close proximity of the microchannels to the laser diode being cooled. This allows high average power operation of two-dimensional laser diode arrays that have a high density of laser diode bars and therefore high optical power density.

  3. Algorithmically specialized parallel computers

    CERN Document Server

    Snyder, Lawrence; Gannon, Dennis B

    1985-01-01

    Algorithmically Specialized Parallel Computers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster

  4. A novel VLSI processor architecture for supercomputing arrays

    Science.gov (United States)

    Venkateswaran, N.; Pattabiraman, S.; Devanathan, R.; Ahmed, Ashaf; Venkataraman, S.; Ganesh, N.

    1993-01-01

    Design of the processor element for general purpose massively parallel supercomputing arrays is highly complex and cost ineffective. To overcome this, the architecture and organization of the functional units of the processor element should be such as to suit the diverse computational structures and simplify mapping of complex communication structures of different classes of algorithms. This demands that the computation and communication structures of different class of algorithms be unified. While unifying the different communication structures is a difficult process, analysis of a wide class of algorithms reveals that their computation structures can be expressed in terms of basic IP,IP,OP,CM,R,SM, and MAA operations. The execution of these operations is unified on the PAcube macro-cell array. Based on this PAcube macro-cell array, we present a novel processor element called the GIPOP processor, which has dedicated functional units to perform the above operations. The architecture and organization of these functional units are such to satisfy the two important criteria mentioned above. The structure of the macro-cell and the unification process has led to a very regular and simpler design of the GIPOP processor. The production cost of the GIPOP processor is drastically reduced as it is designed on high performance mask programmable PAcube arrays.

  5. Design and Implementation of Quintuple Processor Architecture Using FPGA

    Directory of Open Access Journals (Sweden)

    P.Annapurna

    2014-09-01

    Full Text Available The advanced quintuple processor core is a design philosophy that has become a mainstream in Scientific and engineering applications. Increasing performance and gate capacity of recent FPGA devices permit complex logic systems to be implemented on a single programmable device. The embedded multiprocessors face a new problem with thread synchronization. It is caused by the distributed memory, when thread synchronization is violated the processors can access the same value at the same time. Basically the processor performance can be increased by adopting clock scaling technique and micro architectural Enhancements. Therefore, Designed a new Architecture called Advanced Concurrent Computing. This is implemented on the FPGA chip using VHDL. The advanced Concurrent Computing architecture performs a simultaneous use of both parallel and distributed computing. The full architecture of quintuple processor core designed for realistic to perform arithmetic, logical, shifting and bit manipulation operations. The proposed advanced quintuple processor core contains Homogeneous RISC processors, added with pipelined processing units, multi bus organization and I/O ports along with the other functional elements required to implement embedded SOC solutions. The designed quintuple performance issues like area, speed and power dissipation and propagation delay are analyzed at 90nm process technology using Xilinx tool.

  6. Parallel Genetic Algorithm for Alpha Spectra Fitting

    Science.gov (United States)

    García-Orellana, Carlos J.; Rubio-Montero, Pilar; González-Velasco, Horacio

    2005-01-01

    We present a performance study of alpha-particle spectra fitting using parallel Genetic Algorithm (GA). The method uses a two-step approach. In the first step we run parallel GA to find an initial solution for the second step, in which we use Levenberg-Marquardt (LM) method for a precise final fit. GA is a high resources-demanding method, so we use a Beowulf cluster for parallel simulation. The relationship between simulation time (and parallel efficiency) and processors number is studied using several alpha spectra, with the aim of obtaining a method to estimate the optimal processors number that must be used in a simulation.

  7. Remarks on parallel computations in MATLAB environment

    Science.gov (United States)

    Opalska, Katarzyna; Opalski, Leszek

    2013-10-01

    The paper attempts to summarize author's investigation of parallel computation capability of MATLAB environment in solving large ordinary differential equations (ODEs). Two MATLAB versions were tested and two parallelization techniques: one used multiple processors-cores, the other - CUDA compatible Graphics Processing Units (GPUs). A set of parameterized test problems was specially designed to expose different capabilities/limitations of the different variants of the parallel computation environment tested. Presented results illustrate clearly the superiority of the newer MATLAB version and, elapsed time advantage of GPU-parallelized computations for large dimensionality problems over the multiple processor-cores (with speed-up factor strongly dependent on the problem structure).

  8. Embedded Processor Oriented Compiler Infrastructure

    Directory of Open Access Journals (Sweden)

    DJUKIC, M.

    2014-08-01

    Full Text Available In the recent years, research of special compiler techniques and algorithms for embedded processors broaden the knowledge of how to achieve better compiler performance in irregular processor architectures. However, industrial strength compilers, besides ability to generate efficient code, must also be robust, understandable, maintainable, and extensible. This raises the need for compiler infrastructure that provides means for convenient implementation of embedded processor oriented compiler techniques. Cirrus Logic Coyote 32 DSP is an example that shows how traditional compiler infrastructure is not able to cope with the problem. That is why the new compiler infrastructure was developed for this processor, based on research. in the field of embedded system software tools and experience in development of industrial strength compilers. The new infrastructure is described in this paper. Compiler generated code quality is compared with code generated by the previous compiler for the same processor architecture.

  9. Multi Microkernel Operating Systems for Multi-Core Processors

    Directory of Open Access Journals (Sweden)

    Rami Matarneh

    2009-01-01

    Full Text Available Problem statement: In the midst of the huge development in processors industry as a response to the increasing demand for high-speed processors manufacturers were able to achieve the goal of producing the required processors, but this industry disappointed hopes, because it faced problems not amenable to solution, such as complexity, hard management and large consumption of energy. These problems forced the manufacturers to stop the focus on increasing the speed of processors and go toward parallel processing to increase performance. This eventually produced multi-core processors with high-performance, if used properly. Unfortunately, until now, these processors did not use as it should be used; because of lack support of operating system and software applications. Approach: The approach based on the assumption that single-kernel operating system was not enough to manage multi-core processors to rethink the construction of multi-kernel operating system. One of these kernels serves as the master kernel and the others serve as slave kernels. Results: Theoretically, the proposed model showed that it can do much better than the existing models; because it supported single-threaded processing and multi-threaded processing at the same time, in addition, it can make better use of multi-core processors because it divided the load almost equally between the cores and the kernels which will lead to a significant improvement in the performance of the operating system. Conclusion: Software industry needed to get out of the classical framework to be able to keep pace with hardware development, this objective was achieved by re-thinking building operating systems and software in a new innovative methodologies and methods, where the current theories of operating systems were no longer capable of achieving the aspirations of future.

  10. Matrix preconditioning: a robust operation for optical linear algebra processors.

    Science.gov (United States)

    Ghosh, A; Paparao, P

    1987-07-15

    Analog electrooptical processors are best suited for applications demanding high computational throughput with tolerance for inaccuracies. Matrix preconditioning is one such application. Matrix preconditioning is a preprocessing step for reducing the condition number of a matrix and is used extensively with gradient algorithms for increasing the rate of convergence and improving the accuracy of the solution. In this paper, we describe a simple parallel algorithm for matrix preconditioning, which can be implemented efficiently on a pipelined optical linear algebra processor. From the results of our numerical experiments we show that the efficacy of the preconditioning algorithm is affected very little by the errors of the optical system.

  11. Parallelism and Scalability in an Image Processing Application

    DEFF Research Database (Denmark)

    Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven

    2008-01-01

    The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chip. This means that parallel processing is required in application areas that traditionally have not used...... parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately...... available parallelism. It is difficult to further extract parallelism since the application has small data sets and parallelization overhead is relatively high. There is also a fair amount of load imbalance which is made worse by a non-uniform memory latency. Even so, we show that with some tuning relative...

  12. Highly scalable linear solvers on thousands of processors.

    Energy Technology Data Exchange (ETDEWEB)

    Domino, Stefan Paul (Sandia National Laboratories, Albuquerque, NM); Karlin, Ian (University of Colorado at Boulder, Boulder, CO); Siefert, Christopher (Sandia National Laboratories, Albuquerque, NM); Hu, Jonathan Joseph; Robinson, Allen Conrad (Sandia National Laboratories, Albuquerque, NM); Tuminaro, Raymond Stephen

    2009-09-01

    In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.

  13. Classical MD calculations with parallel computers

    Energy Technology Data Exchange (ETDEWEB)

    Matsumoto, Mitsuhiro [Nagoya Univ. (Japan)

    1998-03-01

    We have developed parallel computation codes for a classical molecular dynamics (MD) method. In order to use them on work station clusters as well as parallel super computers, we use MPI (message passing interface) library for distributed-memory type computers. Two algorithms are compared: (1) particle parallelism technique: easy to install, effective for rather small number of processors. (2) region parallelism technique: take some time to install, effective even for many nodes. (J.P.N.)

  14. Parallel Binomial American Option Pricing with (and without) Transaction Costs

    CERN Document Server

    Zhang, Nan; Zastawniak, Tomasz

    2011-01-01

    We present a parallel algorithm that computes the ask and bid prices of an American option when proportional transaction costs apply to the trading of the underlying asset. The algorithm computes the prices on recombining binomial trees, and is designed for modern multi-core processors. Although parallel option pricing has been well studied, none of the existing approaches takes transaction costs into consideration. The algorithm that we propose partitions a binomial tree into blocks. In any round of computation a block is further partitioned into regions which are assigned to distinct processors. To minimise load imbalance the assignment of nodes to processors is dynamically adjusted before each new round starts. Synchronisation is required both within a round and between two successive rounds. The parallel speedup of the algorithm is proportional to the number of processors used. The parallel algorithm was implemented in C/C++ via POSIX Threads, and was tested on a machine with 8 processors. In the pricing ...

  15. A high-speed digital signal processor for atmospheric radar, part 7.3A

    Science.gov (United States)

    Brosnahan, J. W.; Woodard, D. M.

    1984-01-01

    The Model SP-320 device is a monolithic realization of a complex general purpose signal processor, incorporating such features as a 32-bit ALU, a 16-bit x 16-bit combinatorial multiplier, and a 16-bit barrel shifter. The SP-320 is designed to operate as a slave processor to a host general purpose computer in applications such as coherent integration of a radar return signal in multiple ranges, or dedicated FFT processing. Presently available is an I/O module conforming to the Intel Multichannel interface standard; other I/O modules will be designed to meet specific user requirements. The main processor board includes input and output FIFO (First In First Out) memories, both with depths of 4096 W, to permit asynchronous operation between the source of data and the host computer. This design permits burst data rates in excess of 5 MW/s.

  16. A radiation hard bipolar monolithic front-end readout

    CERN Document Server

    Baschirotto, A; Cappelluti, I; Castello, R; Cermesoni, M; Gola, A; Pessina, G; Pistolesi, E; Rancoita, P G; Seidman, A

    1999-01-01

    A fast bipolar monolithic charge sensitive preamplifier (CSP), implemented in the monolithic 2 mu m BiCMOS technology (called HF2CMOS) was designed and built in a quad monolithic chip. Studies of radiation effects in the CSP $9 performance, from non-irradiated and up to neutron irradiation of 5.3*10/sup 14/ n/cm/sup 2/, have confirmed that the use of bipolar npn transistors is suitable for the radiation level of the future LHC collider environment. The CSP $9 presents a new circuit solution for obtaining adequate slew rate performances which results in an integral linearity better than 0.8554330n 5 V at 20 ns of shaping time, regardless of the bias current selected for the CSP. This way $9 the bias current of the CSP can be set for optimizing the power dissipation with respect to series and parallel noise, especially useful when the CSP is put in a radiation environment. A prototype test with a novel monolithic 20 ns $9 time constant RC-CR shaper, capable to sum up four inputs has been also realized, featurin...

  17. Advances in Domain Mapping of Massively Parallel Scientific Computations

    Energy Technology Data Exchange (ETDEWEB)

    Leland, Robert W. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Hendrickson, Bruce A. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

    2015-10-01

    One of the most important concerns in parallel computing is the proper distribution of workload across processors. For most scientific applications on massively parallel machines, the best approach to this distribution is to employ data parallelism; that is, to break the datastructures supporting a computation into pieces and then to assign those pieces to different processors. Collectively, these partitioning and assignment tasks comprise the domain mapping problem.

  18. Fast Forwarding with Network Processors

    OpenAIRE

    Lefèvre, Laurent; Lemoine, E.; Pham, C; Tourancheau, B.

    2003-01-01

    Forwarding is a mechanism found in many network operations. Although a regular workstation is able to perform forwarding operations it still suffers from poor performances when compared to dedicated hardware machines. In this paper we study the possibility of using Network Processors (NPs) to improve the capability of regular workstations to forward data. We present a simple model and an experimental study demonstrating that even though NPs are less powerful than Host Processors (HPs) they ca...

  19. Benchmarking NWP Kernels on Multi- and Many-core Processors

    Science.gov (United States)

    Michalakes, J.; Vachharajani, M.

    2008-12-01

    Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.

  20. Professional Parallel Programming with C# Master Parallel Extensions with NET 4

    CERN Document Server

    Hillar, Gastón

    2010-01-01

    Expert guidance for those programming today's dual-core processors PCs As PC processors explode from one or two to now eight processors, there is an urgent need for programmers to master concurrent programming. This book dives deep into the latest technologies available to programmers for creating professional parallel applications using C#, .NET 4, and Visual Studio 2010. The book covers task-based programming, coordination data structures, PLINQ, thread pools, asynchronous programming model, and more. It also teaches other parallel programming techniques, such as SIMD and vectorization.Teach

  1. Customizable Memory Schemes for Data Parallel Architectures

    NARCIS (Netherlands)

    Gou, C.

    2011-01-01

    Memory system efficiency is crucial for any processor to achieve high performance, especially in the case of data parallel machines. Processing capabilities of parallel lanes will be wasted, when data requests are not accomplished in a sustainable and timely manner. Irregular vector memory accesses

  2. Customizable Memory Schemes for Data Parallel Architectures

    NARCIS (Netherlands)

    Gou, C.

    2011-01-01

    Memory system efficiency is crucial for any processor to achieve high performance, especially in the case of data parallel machines. Processing capabilities of parallel lanes will be wasted, when data requests are not accomplished in a sustainable and timely manner. Irregular vector memory accesses

  3. Dynamic Interconnection Techniques for Parallel Communication

    Directory of Open Access Journals (Sweden)

    P.K. Das Gupta

    1994-04-01

    Full Text Available Parallel computing has contributed significantly to Defence applications. This field has helped in the development of dedicated hardware/software for ballistic missile defence system. In parallel computers, interprocessors communication mechanism strongly influences overall performance of the system. If a parallel computer is composed of n processors and each processor has a performance p, then the maximum potential of the system is np. The present review concentrates on interprocessor communication. Crossbar and delta network is discussed in detail. Cogent XTM - a better solution over crossbar network is also discussed.

  4. PVM Enhancement for Beowulf Multiple-Processor Nodes

    Science.gov (United States)

    Springer, Paul

    2006-01-01

    A recent version of the Parallel Virtual Machine (PVM) computer program has been enhanced to enable use of multiple processors in a single node of a Beowulf system (a cluster of personal computers that runs the Linux operating system). A previous version of PVM had been enhanced by addition of a software port, denoted BEOLIN, that enables the incorporation of a Beowulf system into a larger parallel processing system administered by PVM, as though the Beowulf system were a single computer in the larger system. BEOLIN spawns tasks on (that is, automatically assigns tasks to) individual nodes within the cluster. However, BEOLIN does not enable the use of multiple processors in a single node. The present enhancement adds support for a parameter in the PVM command line that enables the user to specify which Internet Protocol host address the code should use in communicating with other Beowulf nodes. This enhancement also provides for the case in which each node in a Beowulf system contains multiple processors. In this case, by making multiple references to a single node, the user can cause the software to spawn multiple tasks on the multiple processors in that node.

  5. Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

    Science.gov (United States)

    Sharma, Anuj; Manolakos, Elias S

    2015-01-01

    Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.

  6. Dynamically Reconfigurable Processor for Floating Point Arithmetic

    Directory of Open Access Journals (Sweden)

    S. Anbumani,

    2014-01-01

    Full Text Available Recently, development of embedded processors is toward miniaturization and energy saving for ecology. On the other hand, high performance arithmetic circuits are required in a lot of application in science and technology. Dynamically reconfigurable processors have been developed to meet these requests. They can change circuit configuration according to instructions in program instantly during operations.This paper describes, a dynamically reconfigurable circuit for floating-point arithmetic is proposed. The arithmetic circuit consists of two single precision floating-point arithmetic circuits. It performs double precision floating-point arithmetic by reconfiguration. Dynamic reconfiguration changes circuit construction at one clock cycle during operation without stopping circuits. It enables reconfiguration of circuits in a few nano seconds. The proposed circuit is reconfigured in two modes. In first mode it performs one double precision floating-point arithmetic or else the circuit will perform two parallel operations of single precision floating-point arithmetic. The new system design reduces implementation area by reconfiguring common parts of each operation. It also increases the processing speed with a very little number of clocks.

  7. Parallel Markov chain Monte Carlo simulations.

    Science.gov (United States)

    Ren, Ruichao; Orkoulas, G

    2007-06-07

    With strict detailed balance, parallel Monte Carlo simulation through domain decomposition cannot be validated with conventional Markov chain theory, which describes an intrinsically serial stochastic process. In this work, the parallel version of Markov chain theory and its role in accelerating Monte Carlo simulations via cluster computing is explored. It is shown that sequential updating is the key to improving efficiency in parallel simulations through domain decomposition. A parallel scheme is proposed to reduce interprocessor communication or synchronization, which slows down parallel simulation with increasing number of processors. Parallel simulation results for the two-dimensional lattice gas model show substantial reduction of simulation time for systems of moderate and large size.

  8. Simultaneous multithreaded processor enhanced for multimedia applications

    Science.gov (United States)

    Mombers, Friederich; Thomas, Michel

    1999-12-01

    The paper proposes a new media processor architecture specifically designed to handle state-of-the-art multimedia encoding and decoding tasks. To achieve this, the architecture efficiently exploit Data-, Instruction- and Thread-Level parallelisms while continuously adapting its computational resources to reach the most appropriate parallelism level among all the concurrent encoding/decoding processes. Looking at the implementation constraints, several critical choices were adopted that solve the interconnection delay problem, lower the cache misses and pipeline stalls effects and reduce register files and memory size by adopting a clustered Simultaneous Multithreaded Architecture. We enhanced the classic model to exploit both Instruction and Data Level Parallelism through vector instructions. The vector extension is well justified for multimedia workload and improves code density, crossbars complexity, register file ports and decoding logic area while it still provides an efficient way to fully exploit a large set of functional units. An MPEG-2 encoding algorithms based on Hybrid Genetic search has been implemented that show the efficiency of the architecture to adapt its resources allocation to better fulfill the application requirements.

  9. A Parallel Encryption Algorithm Based on Piecewise Linear Chaotic Map

    Directory of Open Access Journals (Sweden)

    Xizhong Wang

    2013-01-01

    Full Text Available We introduce a parallel chaos-based encryption algorithm for taking advantage of multicore processors. The chaotic cryptosystem is generated by the piecewise linear chaotic map (PWLCM. The parallel algorithm is designed with a master/slave communication model with the Message Passing Interface (MPI. The algorithm is suitable not only for multicore processors but also for the single-processor architecture. The experimental results show that the chaos-based cryptosystem possesses good statistical properties. The parallel algorithm provides much better performance than the serial ones and would be useful to apply in encryption/decryption file with large size or multimedia.

  10. Scalable Parallel Algebraic Multigrid Solvers

    Energy Technology Data Exchange (ETDEWEB)

    Bank, R; Lu, S; Tong, C; Vassilevski, P

    2005-03-23

    The authors propose a parallel algebraic multilevel algorithm (AMG), which has the novel feature that the subproblem residing in each processor is defined over the entire partition domain, although the vast majority of unknowns for each subproblem are associated with the partition owned by the corresponding processor. This feature ensures that a global coarse description of the problem is contained within each of the subproblems. The advantages of this approach are that interprocessor communication is minimized in the solution process while an optimal order of convergence rate is preserved; and the speed of local subproblem solvers can be maximized using the best existing sequential algebraic solvers.

  11. Parallel implementation of an algorithm for Delaunay triangulation

    Science.gov (United States)

    Merriam, Marshal L.

    1992-01-01

    The theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer, is described. Efficient implementation of Tanemura's algorithm on a conventional, vector processing supercomputer is problematic. It does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a parallel architecture is possible, however. Speeds in excess of 20 times a single processor Cray Y-MP are realized on 128 processors of the Intel Gamma prototype.

  12. Implementation Of A Prototype Digital Optical Cellular Image Processor (DOCIP)

    Science.gov (United States)

    Huang, K. S.; Sawchuk, A. A.; Jenkins, B. K.; Chavel, P.; Wang, J. M.; Weber, A. G.; Wang, C. H.; Glaser, I.

    1989-02-01

    A processing element of a prototype digital optical cellular image processor (DOCIP) is implemented to demonstrate a particular parallel computing and interconnection architecture. This experimental digital optical computing system consists of a 2-D array of 54 optical logic gates, a 2-D array of 53 subholograms to provide interconnections between gates, and electronic input/output interfaces. The multi-facet interconnection hologram used in this system is fabricated by a computer-controlled optical system to offer very flexible interconnections.

  13. Hierarchical Parallel Evaluation of a Hamming Code

    Directory of Open Access Journals (Sweden)

    Shmuel T. Klein

    2017-04-01

    Full Text Available The Hamming code is a well-known error correction code and can correct a single error in an input vector of size n bits by adding logn parity checks. A new parallel implementation of the code is presented, using a hierarchical structure of n processors in logn layers. All the processors perform similar simple tasks, and need only a few bytes of internal memory.

  14. On-board neural processor design for intelligent multisensor microspacecraft

    Science.gov (United States)

    Fang, Wai-Chi; Sheu, Bing J.; Wall, James

    1996-03-01

    A compact VLSI neural processor based on the Optimization Cellular Neural Network (OCNN) has been under development to provide a wide range of support for an intelligent remote sensing microspacecraft which requires both high bandwidth communication and high- performance computing for on-board data analysis, thematic data reduction, synergy of multiple types of sensors, and other advanced smart-sensor functions. The OCNN is developed with emphasis on its capability to find global optimal solutions by using a hardware annealing method. The hardware annealing function is embedded in the network. It is a parallel version of fast mean-field annealing in analog networks, and is highly efficient in finding globally optimal solutions for cellular neural networks. The OCNN is designed to perform programmable functions for fine-grained processing with annealing control to enhance the output quality. The OCNN architecture is a programmable multi-dimensional array of neurons which are locally connected with their local neurons. Major design features of the OCNN neural processor includes massively parallel neural processing, hardware annealing capability, winner-take-all mechanism, digitally programmable synaptic weights, and multisensor parallel interface. A compact current-mode VLSI design feasibility of the OCNN neural processor is demonstrated by a prototype 5 X 5-neuroprocessor array chip in a 2-micrometers CMOS technology. The OCNN operation theory, architecture, design and implementation, prototype chip, and system applications have been investigated in detail and presented in this paper.

  15. Using Graphics Processors for Parallelizing Hash-based Data Carving

    CERN Document Server

    Collange, Sylvain; Daumas, Marc; Defour, David

    2009-01-01

    The ability to detect fragments of deleted image files and to reconstruct these image files from all available fragments on disk is a key activity in the field of digital forensics. Although reconstruction of image files from the file fragments on disk can be accomplished by simply comparing the content of sectors on disk with the content of known files, this brute-force approach can be time consuming. This paper presents results from research into the use of Graphics Processing Units (GPUs) in detecting specific image file byte patterns in disk clusters. Unique identifying pattern for each disk sector is compared against patterns in known images. A pattern match indicates the potential presence of an image and flags the disk sector for further in-depth examination to confirm the match. The GPU-based implementation outperforms the software implementation by a significant margin.

  16. Approximate inference for spatial functional data on massively parallel processors

    DEFF Research Database (Denmark)

    Raket, Lars Lau; Markussen, Bo

    2014-01-01

    With continually increasing data sizes, the relevance of the big n problem of classical likelihood approaches is greater than ever. The functional mixed-effects model is a well established class of models for analyzing functional data. Spatial functional data in a mixed-effects setting...... in linear time. An extremely efficient GPU implementation is presented, and the proposed methods are illustrated by conducting a classical statistical analysis of 2D chromatography data consisting of more than 140 million spatially correlated observation points....

  17. JESPP: Joint Experimentation on Scalable Parallel Processors Supercomputers

    Science.gov (United States)

    2010-03-01

    for traceability of individual pro- grammers during the tumult of operations in a simulation bay, where many operators will need to log-in, use...of which remains to be apprehended. The author’s experienced in teaching an introductory course on Data Mining at the Viterbi School of Engineering

  18. Scalable Parallelization of Skyline Computation for Multi-core Processors

    DEFF Research Database (Denmark)

    Chester, Sean; Sidlauskas, Darius; Assent, Ira

    2015-01-01

    The skyline is an important query operator for multi-criteria decision making. It reduces a dataset to only those points that offer optimal trade-offs of dimensions. In general, it is very expensive to compute. Recently, multi-core CPU algorithms have been proposed to accelerate the computation o...

  19. Optimal job splitting in parallel processor sharing queues

    NARCIS (Netherlands)

    G.J. Hoekstra (Gerard); R.D. van der Mei (Rob); S. Bhulai (Sandjai)

    2012-01-01

    htmlabstractThe main barrier to the sustained growth of wireless communications is the Shannon limit that applies to the channel capacity. A promising means to realize high-capacity enhancements is the use of multi-path communication solutions to improve reliability and network performance in

  20. High Volume Colour Image Processing with Massively Parallel Embedded Processors

    NARCIS (Netherlands)

    Jacobs, Jan W.M.; Bond, W.; Pouls, R.; Smit, Gerard J.M.; Joubert, G.R.; Peters, F.J.; Tirado, P.; Nagel, W.E.; Plata, O.; Zapata, E.

    2006-01-01

    Currently Oce uses FPGA technology for implementing colour image processing for their high volume colour printers. Although FPGA technology provides enough performance it, however, has a rather tedious development process. This paper describes the research conducted on an alternative implementation

  1. Processing modes and parallel processors in producing familiar keying sequences

    NARCIS (Netherlands)

    Verwey, Willem B.

    2003-01-01

    Recent theorizing indicates that the acquisition of movement sequence skill involves the development of several independent sequence representations at the same time. To examine this for the discrete sequence production task, participants in Experiment 1 produced a highly practiced sequence of six k

  2. Scalable Parallelization of Skyline Computation for Multi-core Processors

    DEFF Research Database (Denmark)

    Chester, Sean; Sidlauskas, Darius; Assent, Ira;

    2015-01-01

    , which is used to minimize dominance tests while maintaining high throughput. The algorithm uses an efficiently-updatable data structure over the shared, global skyline, based on point-based partitioning. Also, we release a large benchmark of optimized skyline algorithms, with which we demonstrate...

  3. Design and Implementation of 64-Bit Execute Stage for VLIW Processor Architecture on FPGA

    Directory of Open Access Journals (Sweden)

    Manju Rani

    2012-07-01

    Full Text Available FPGA implementation of 64-bit execute unit for VLIW processor, and improve power representation have been done in this paper. VHDL is used to modelled this architecture. VLIW stands for Very Long Instruction Word. This Processor Architecture is based on parallel processing in which more than one instruction is executed in parallel. This architecture is used to increase the instruction throughput. So this is the base of the modern Superscalar Processors. Basically VLIW is a RISC Processor. The difference is it contains long instruction as compared to RISC. This stage of the pipeline executes the instruction. This is the stage where the ALU (arithmetic logic unit is located. Execute stage are synthesized and targeted for Xilinx Virtex 4 FPGA and the results calculated for 64-bit Execute stage improve the power as compared to previous work done.

  4. Libera Electron Beam Position Processor

    CERN Document Server

    Ursic, Rok

    2005-01-01

    Libera is a product family delivering unprecedented possibilities for either building powerful single station solutions or architecting complex feedback systems in the field of accelerator instrumentation and controls. This paper presents functionality and field performance of its first member, the electron beam position processor. It offers superior performance with multiple measurement channels delivering simultaneously position measurements in digital format with MHz kHz and Hz bandwidths. This all-in-one product, facilitating pulsed and CW measurements, is much more than simply a high performance beam position measuring device delivering micrometer level reproducibility with sub-micrometer resolution. Rich connectivity options and innate processing power make it a powerful feedback building block. By interconnecting multiple Libera electron beam position processors one can build a low-latency high throughput orbit feedback system without adding additional hardware. Libera electron beam position processor ...

  5. Java Processor Optimized for RTSJ

    Directory of Open Access Journals (Sweden)

    Tu Shiliang

    2007-01-01

    Full Text Available Due to the preeminent work of the real-time specification for Java (RTSJ, Java is increasingly expected to become the leading programming language in real-time systems. To provide a Java platform suitable for real-time applications, a Java processor which can execute Java bytecode is directly proposed in this paper. It provides efficient support in hardware for some mechanisms specified in the RTSJ and offers a simpler programming model through ameliorating the scoped memory of the RTSJ. The worst case execution time (WCET of the bytecodes implemented in this processor is predictable by employing the optimization method proposed in our previous work, in which all the processing interfering predictability is handled before bytecode execution. Further advantage of this method is to make the implementation of the processor simpler and suited to a low-cost FPGA chip.

  6. Making CSB + -Trees Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....

  7. Reconfigurable optical switches with monolithic electrical-to-optical interfaces

    Energy Technology Data Exchange (ETDEWEB)

    Cheng, J.; Zhou, P. [New Mexico Univ., Albuquerque, NM (United States). Center for High Technology Materials; Zolper, J.C.; Lear, K.L.; Vawter, G.A. [Sandia National Labs., Albuquerque, NM (United States); Leibenguth, R.E.; Adams, A.C. [AT and T Bell Labs., Breinigsville, PA (United States)

    1994-03-01

    Vertical cavity surface-emitting lasers (VCSELs) can be integrated with heterojunction phototransistors (HPTs) and heterojunction bipolar transistors (HBTs) on the same wafer to form high speed optical and optoelectronic switches, respectively, that can be optically or electrically addressed. This permits the direct communication and transmission of data between distributed electronic processors through an optical switching network. The experimental demonstration of an integrated optoelectronic HBT/VCSEL switch combining a GaAs/AlGaAs heterojunction bipolar transistor (HBT) with a VCSEL is described below, using the same epilayer structure upon which binary HPT/VCSEL optical switches are also built. The monolithic HBT/VCSEL switch has high current gain, low power dissipation, and a high optical to electrical conversion efficiency. Its modulation has been measured and modeled.

  8. In situ Fabrication of Monolithic Copper Azide

    Science.gov (United States)

    Li, Bing; Li, Mingyu; Zeng, Qingxuan; Wu, Xingyu

    2016-04-01

    Fabrication and characterization of monolithic copper azide were performed. The monolithic nanoporous copper (NPC) with interconnected pores and nanoparticles was prepared by decomposition and sintering of the ultrafine copper oxalate. The preferable monolithic NPC can be obtained through decomposition and sintering at 400°C for 30 min. Then, the available monolithic NPC was in situ reacted with the gaseous HN3 for 24 h and the monolithic NPC was transformed into monolithic copper azide. Additionally, the copper particles prepared by electrodeposition were also reacted with the gaseous HN3 under uniform conditions as a comparison. The fabricated monolithic copper azide was characterized by Fourier transform infrared (FTIR), inductively coupled plasma-optical emission spectrometry (ICP-OES), and differential scanning calorimetry (DSC).

  9. Parallel Feature Extraction System

    Institute of Scientific and Technical Information of China (English)

    MAHuimin; WANGYan

    2003-01-01

    Very high speed image processing is needed in some application specially for weapon. In this paper, a high speed image feature extraction system with parallel structure was implemented by Complex programmable logic device (CPLD), and it can realize image feature extraction in several microseconds almost with no delay. This system design is presented by an application instance of flying plane, whose infrared image includes two kinds of feature: geometric shape feature in the binary image and temperature-feature in the gray image. Accordingly the feature extraction is taken on the two kind features. Edge and area are two most important features of the image. Angle often exists in the connection of the different parts of the target's image, which indicates that one area ends and the other area begins. The three key features can form the whole presentation of an image. So this parallel feature extraction system includes three processing modules: edge extraction, angle extraction and area extraction. The parallel structure is realized by a group of processors, every detector is followed by one route of processor, every route has the same circuit form, and works together at the same time controlled by a set of clock to realize feature extraction. The extraction system has simple structure, small volume, high speed, and better stability against noise. It can be used in the war field recognition system.

  10. Parallel computing, failure recovery, and extreme values

    DEFF Research Database (Denmark)

    Andersen, Lars Nørvang; Asmussen, Søren

    A task of random size T is split into M subtasks of lengths T1, . . . , TM, each of which is sent to one out of M parallel processors. Each processor may fail at a random time before completing its allocated task, and then has to restart it from the beginning. If X1, . . . ,XM are the total task...... times at the M processors, the overall total task time is then ZM = max1,...,MXi. Limit theorems as M → ∞ are given for ZM, allowing the distribution of T to depend on M. In some cases the limits are classical extreme value distributions, in others they are of a different type....

  11. Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

    Science.gov (United States)

    Chatterjee, Siddhartha; Gunnels, John A.

    2011-11-08

    A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

  12. Cluster Algorithm Special Purpose Processor

    Science.gov (United States)

    Talapov, A. L.; Shchur, L. N.; Andreichenko, V. B.; Dotsenko, Vl. S.

    We describe a Special Purpose Processor, realizing the Wolff algorithm in hardware, which is fast enough to study the critical behaviour of 2D Ising-like systems containing more than one million spins. The processor has been checked to produce correct results for a pure Ising model and for Ising model with random bonds. Its data also agree with the Nishimori exact results for spin glass. Only minor changes of the SPP design are necessary to increase the dimensionality and to take into account more complex systems such as Potts models.

  13. Cluster algorithm special purpose processor

    Energy Technology Data Exchange (ETDEWEB)

    Talapov, A.L.; Shchur, L.N.; Andreichenko, V.B.; Dotsenko, V.S. (Landau Inst. for Theoretical Physics, GSP-1 117940 Moscow V-334 (USSR))

    1992-08-10

    In this paper, the authors describe a Special Purpose Processor, realizing the Wolff algorithm in hardware, which is fast enough to study the critical behaviour of 2D Ising-like systems containing more than one million spins. The processor has been checked to produce correct results for a pure Ising model and for Ising model with random bonds. Its data also agree with the Nishimori exact results for spin glass. Only minor changes of the SPP design are necessary to increase the dimensionality and to take into account more complex systems such as Potts models.

  14. High performance deformable image registration algorithms for manycore processors

    CERN Document Server

    Shackleford, James; Sharp, Gregory

    2013-01-01

    High Performance Deformable Image Registration Algorithms for Manycore Processors develops highly data-parallel image registration algorithms suitable for use on modern multi-core architectures, including graphics processing units (GPUs). Focusing on deformable registration, we show how to develop data-parallel versions of the registration algorithm suitable for execution on the GPU. Image registration is the process of aligning two or more images into a common coordinate frame and is a fundamental step to be able to compare or fuse data obtained from different sensor measurements. E

  15. Reconfigurable Communication Processor:A New Approach for Network Processor

    Institute of Scientific and Technical Information of China (English)

    孙华; 陈青山; 张文渊

    2003-01-01

    As the traditional RISC +ASIC/ASSP approach for network processor design can not meet the today'srequirements, this paper described an alternate approach, Reconfigurable Processing Architecture, to boost theperformance to ASIC level while reserve the programmability of the traditional RISC based system. This papercovers both the hardware architecture and the software development environment architecture.

  16. Parallel Algorithm Design on Some Distributed Systems

    Institute of Scientific and Technical Information of China (English)

    孙家昶; 张林波; 等

    1997-01-01

    Some testing results on DAWINING-1000,Paragon and workstation cluster are described in this paper.On the home-made parallel system DAWNING-1000 with 32 computational processors,the practical performance of 1.1777 Gflops and 1.58 Gflops has been measured in solving a dense linear system and doing matrix multiplication,respectively .The scalability is also investigated.The importance of designing efficient parallel algorithms for evaluating parallel systems is emphasized.

  17. Parallel network simulations with NEURON.

    Science.gov (United States)

    Migliore, M; Cannia, C; Lytton, W W; Markram, Henry; Hines, M L

    2006-10-01

    The NEURON simulation environment has been extended to support parallel network simulations. Each processor integrates the equations for its subnet over an interval equal to the minimum (interprocessor) presynaptic spike generation to postsynaptic spike delivery connection delay. The performance of three published network models with very different spike patterns exhibits superlinear speedup on Beowulf clusters and demonstrates that spike communication overhead is often less than the benefit of an increased fraction of the entire problem fitting into high speed cache. On the EPFL IBM Blue Gene, almost linear speedup was obtained up to 100 processors. Increasing one model from 500 to 40,000 realistic cells exhibited almost linear speedup on 2,000 processors, with an integration time of 9.8 seconds and communication time of 1.3 seconds. The potential for speed-ups of several orders of magnitude makes practical the running of large network simulations that could otherwise not be explored.

  18. Track finding processor in the DTBX based CMS barrel muon trigger

    CERN Document Server

    Kluge, A

    1996-01-01

    We present the design and simulation of the track finding processor in the DTBX ( Drift Tube with Bunch Crossing Identification) based CMS barrel muon trigger system. The processor searches for muon tracks originating from the interaction region by joining the track segments provided by the mean timer processors of the drift chambers to track strings. It assigns transverse momenta to the reconstructed tracks using the tracks' bending angle. High speed is achieved by performing the track reconstruction fully in parallel. In this contribution the algorithms, implementation and simulation results are presented.

  19. Optimal processor allocation for sort-last compositing under BSP-tree ordering

    Science.gov (United States)

    Ramakrishnan, C. R.; Silva, Claudio T.

    1999-03-01

    In this paper, we consider a parallel rendering model that exploits the fundamental distinction between rendering and compositing operations, by assigning processors from specialized pools for each of these operations. Our motivation is to support the parallelization of general scan-line rendering algorithms with minimal effort, basically by supporting a compositing back-end (i.e., a sort-last architecture) that is able to perform user-controlled image composition. Our computational model is based on organizing rendering as well as compositing processors on a BSP-tree, whose internal nodes we call the compositing tree. Many known rendering algorithms, such as volumetric ray casting and polygon rendering can be easily parallelized based on the structure of the BSP-tree. In such a framework, it is paramount to minimize the processing power devoted to compositing, by minimizing the number of processors allocated for composition as well as optimizing the individual compositing operations. In this paper, we address the problems related to the static allocation of processor resources to the compositing tree. In particular, we present an optimal algorithm to allocate compositing operations to compositing processors. We also present techniques to evaluate the compositing operations within each processor using minimum memory while promoting concurrency between computation and communication. We describe the implementation details and provide experimental evidence of the validity of our techniques in practice.

  20. Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA

    Science.gov (United States)

    Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

    2013-03-01

    With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.

  1. ASSP Advanced Sensor Signal Processor.

    Science.gov (United States)

    1984-06-01

    transfer data sad cimeds . When a Processor receives the required data (Image) md/or oamand, that data will be operated on B-3 I I I autonomouly. The...BAN is provided by two separately controled DMA address generator chips (Am29o40). Each of these DMA chips create an 8 bit address. One DMA chip gene

  2. Parallelization of Edge Detection Algorithm using MPI on Beowulf Cluster

    Science.gov (United States)

    Haron, Nazleeni; Amir, Ruzaini; Aziz, Izzatdin A.; Jung, Low Tan; Shukri, Siti Rohkmah

    In this paper, we present the design of parallel Sobel edge detection algorithm using Foster's methodology. The parallel algorithm is implemented using MPI message passing library and master/slave algorithm. Every processor performs the same sequential algorithm but on different part of the image. Experimental results conducted on Beowulf cluster are presented to demonstrate the performance of the parallel algorithm.

  3. A reconfigurable optoelectronic interconnect technology for multi-processor networks

    Energy Technology Data Exchange (ETDEWEB)

    Lu, Y.C.; Cheng, J. [Univ. of New Mexico, Albuquerque, NM (United States). Center for High Technology Materials; Zolper, J.C.; Klem, J. [Sandia National Labs., Albuquerque, NM (United States)

    1995-05-01

    This paper describes a new optical interconnect architecture and the integrated optoelectronic circuit technology for implementing a parallel, reconfigurable, multiprocessor network. The technology consists of monolithic array`s of optoelectronic switches that integrate vertical-cavity surface-emitting lasers with three-terminal heterojunction phototransistors, which effectively combined the functions of an optical transceiver and an optical spatial routing switch. These switches have demonstrated optical switching at 200 Mb/s, and electrical-to-optical data conversion at > 500 Mb/s, with a small-signal electrical-to-optical modulation bandwidth of {approximately} 4 GHz.

  4. Real-Time Signal Processor for Pulsar Studies

    Indian Academy of Sciences (India)

    P. S. Ramkumar; A. A. Deshpande

    2001-12-01

    This paper describes the design, tests and preliminary results of a real-time parallel signal processor built to aid a wide variety of pulsar observations. The signal processor reduces the distortions caused by the effects of dispersion, Faraday rotation, doppler acceleration and parallactic angle variations, at a sustained data rate of 32 Msamples/sec. It also folds the pulses coherently over the period and integrates adjacent samples in time and frequency to enhance the signal-to-noise ratio. The resulting data are recorded for further off-line analysis of the characteristics of pulsars and the intervening medium. The signal processing for analysis of pulsar signals is quite complex, imposing the need for a high computational throughput, typically of the order of a Giga operations per second (GOPS). Conventionally, the high computational demand restricts the flexibility to handle only a few types of pulsar observations. This instrument is designed to handle a wide variety of Pulsar observations with the Giant Metre Wave Radio Telescope (GMRT), and is flexible enough to be used in many other high-speed, signal processing applications. The technology used includes field-programmable-gate-array(FPGA) based data/code routing interfaces, PC-AT based control, diagnostics and data acquisition, digital signal processor (DSP) chip based parallel processing nodes and C language based control software and DSP-assembly programs for signal processing. The architecture and the software implementation of the parallel processor are fine-tuned to realize about 60 MOPS per DSP node and a multiple-instruction-multiple-data (MIMD) capability.

  5. Cassava processors' awareness of occupational and environmental ...

    African Journals Online (AJOL)

    Cassava processors' awareness of occupational and environmental hazards ... Majority of the respondents also complained of lack of water (78.4%), lack of ... so as to reduce the problems faced by cassava processors during processing.

  6. PDDP, A Data Parallel Programming Model

    Directory of Open Access Journals (Sweden)

    Karen H. Warren

    1996-01-01

    Full Text Available PDDP, the parallel data distribution preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP implements high-performance Fortran-compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the WHERE construct. Distributed data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

  7. Design Principles for Synthesizable Processor Cores

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

    2012-01-01

    As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...... on FPGA-based processor cores: first, superpipelining enables higher-frequency system clocks, and second, predicated instructions circumvent costly pipeline stalls due to branches. To evaluate their effects, we develop Tinuso, a processor architecture optimized for FPGA implementation. We demonstrate...

  8. Simulation of Particulate Flows Multi-Processor Machines with Distributed Memory

    Energy Technology Data Exchange (ETDEWEB)

    Uhlmann, M.

    2004-07-01

    We presented a method for the parallelization of an immersed boundary algorithm for particulate flows using the MPI standard of communication. The treatment of the fluid phase used the domain decomposition technique over a Cartesian processor grid. The solution of the Helmholtz problem is approximately factorized an relies upon apparel tri-diagonal solver the Poisson problem is solved by means of a parallel multi-grid technique similar to MUDPACK. for the solid phase we employ a master-slaves technique where one processor handles all the particles contained in its Eulerian fluid sub-domain and zero or more neighbor processors collaborate in the computation of particle-related quantities whenever a particle position over laps the boundary of a sub-domain. the parallel efficiency for some preliminary computations is presented. (Author) 9 refs.

  9. Platforms for artificial neural networks : neurosimulators and performance prediction of MIMD-parallel systems

    NARCIS (Netherlands)

    Vuurpijl, L.G.

    1998-01-01

    In this thesis, two platforms for simulating artificial neural networks are discussed: MIMD-parallel processor systems as an execution platform and neurosimulators as a research and development platform. Because of the parallelism encountered in neural networks, distributed processor systems seem to

  10. Platforms for artificial neural networks : neurosimulators and performance prediction of MIMD-parallel systems

    NARCIS (Netherlands)

    Vuurpijl, L.G.

    1998-01-01

    In this thesis, two platforms for simulating artificial neural networks are discussed: MIMD-parallel processor systems as an execution platform and neurosimulators as a research and development platform. Because of the parallelism encountered in neural networks, distributed processor systems seem to

  11. Parallel contingency statistics with Titan.

    Energy Technology Data Exchange (ETDEWEB)

    Thompson, David C.; Pebay, Philippe Pierre

    2009-09-01

    This report summarizes existing statistical engines in VTK/Titan and presents the recently parallelized contingency statistics engine. It is a sequel to [PT08] and [BPRT09] which studied the parallel descriptive, correlative, multi-correlative, and principal component analysis engines. The ease of use of this new parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; however, the very nature of contingency tables prevent this new engine from exhibiting optimal parallel speed-up as the aforementioned engines do. This report therefore discusses the design trade-offs we made and study performance with up to 200 processors.

  12. 40 CFR 791.45 - Processors.

    Science.gov (United States)

    2010-07-01

    ... 40 Protection of Environment 31 2010-07-01 2010-07-01 true Processors. 791.45 Section 791.45 Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) TOXIC SUBSTANCES CONTROL ACT (CONTINUED) DATA REIMBURSEMENT Basis for Proposed Order § 791.45 Processors. (a) Generally, processors will be...

  13. Monolithic Fuel Fabrication Process Development

    Energy Technology Data Exchange (ETDEWEB)

    C. R. Clark; N. P. Hallinan; J. F. Jue; D. D. Keiser; J. M. Wight

    2006-05-01

    The pursuit of a high uranium density research reactor fuel plate has led to monolithic fuel, which possesses the greatest possible uranium density in the fuel region. Process developments in fabrication development include friction stir welding tool geometry and cooling improvements and a reduction in the length of time required to complete the transient liquid phase bonding process. Annealing effects on the microstructures of the U-10Mo foil and friction stir welded aluminum 6061 cladding are also examined.

  14. Parallel FFT Algorithm on Computer Clusters

    Institute of Scientific and Technical Information of China (English)

    2005-01-01

    DFT is widely applied in the field of signal process and others. Most present rapid ways of calculation are either based on paralleled computers connected by such particular systems like butterfly network, hypercube etc;or based on the assumption of instant transportation, non-conflict communication, complete connection of paralleled processors and unlimited usable processors. However, the delay of communication in the system of information transmission cannot be ignored. This paper works on the following aspects: instant transmission, dispatching missions, and the path of information through the communication link in the computer cluster systems;layout of the dynamic FFT algorithm under the different structures of computer clusters.

  15. Parallel solutions of correlation dimension calculation

    Institute of Scientific and Technical Information of China (English)

    2005-01-01

    The calculation of correlation dimension is a key problem of the fractals. The standard algorithm requires O(N2) computations. The previous improvement methods endeavor to sequentially reduce redundant computation on condition that there are many different dimensional phase spaces, whose application area and performance improvement degree are limited. This paper presents two fast parallel algorithms: O(N2/p + logp) time p processors PRAM algorithm and O(N2/p) time p processors LARPBS algorithm. Analysis and results of numeric computation indicate that the speedup of parallel algorithms relative to sequence algorithms is efficient. Compared with the PRAM algorithm, The LARPBS algorithm is practical, optimally scalable and cost optimal.

  16. Massive Parallel Quantum Computer Simulator

    CERN Document Server

    De Raedt, K; De Raedt, H; Ito, N; Lippert, T; Michielsen, K; Richter, M; Trieu, B; Watanabe, H; Lippert, Th.

    2006-01-01

    We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers.

  17. Graphene-supported metal oxide monolith

    Energy Technology Data Exchange (ETDEWEB)

    Worsley, Marcus A.; Baumann, Theodore F.; Biener, Juergen; Biener, Monika A.; Wang, Yinmin; Ye, Jianchao; Tylski, Elijah

    2017-01-10

    A composition comprising at least one graphene-supported metal oxide monolith, said monolith comprising a three-dimensional structure of graphene sheets crosslinked by covalent carbon bonds, wherein the graphene sheets are coated by at least one metal oxide such as iron oxide or titanium oxide. Also provided is an electrode comprising the aforementioned graphene-supported metal oxide monolith, wherein the electrode can be substantially free of any carbon-black and substantially free of any binder.

  18. Graphene-supported metal oxide monolith

    Science.gov (United States)

    Worsley, Marcus A.; Baumann, Theodore F.; Biener, Juergen; Biener, Monika A.; Wang, Yinmin; Ye, Jianchao; Tylski, Elijah

    2017-01-10

    A composition comprising at least one graphene-supported metal oxide monolith, said monolith comprising a three-dimensional structure of graphene sheets crosslinked by covalent carbon bonds, wherein the graphene sheets are coated by at least one metal oxide such as iron oxide or titanium oxide. Also provided is an electrode comprising the aforementioned graphene-supported metal oxide monolith, wherein the electrode can be substantially free of any carbon-black and substantially free of any binder.

  19. A Tutorial on Parallel and Concurrent Programming in Haskell

    Science.gov (United States)

    Peyton Jones, Simon; Singh, Satnam

    This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.

  20. Design and implementation of a high performance network security processor

    Science.gov (United States)

    Wang, Haixin; Bai, Guoqiang; Chen, Hongyi

    2010-03-01

    The last few years have seen many significant progresses in the field of application-specific processors. One example is network security processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from network processors (NPs). This article presents a high performance NSP system architecture implementation intended for both internet protocol security (IPSec) and secure socket layer (SSL) protocol acceleration, which are widely employed in virtual private network (VPN) and e-commerce applications. The efficient dual one-way pipelined data transfer skeleton and optimised integration scheme of the heterogenous parallel crypto engine arrays lead to a Gbps rate NSP, which is programmable with domain specific descriptor-based instructions. The descriptor-based control flow fragments large data packets and distributes them to the crypto engine arrays, which fully utilises the parallel computation resources and improves the overall system data throughput. A prototyping platform for this NSP design is implemented with a Xilinx XC3S5000 based FPGA chip set. Results show that the design gives a peak throughput for the IPSec ESP tunnel mode of 2.85 Gbps with over 2100 full SSL handshakes per second at a clock rate of 95 MHz.

  1. Parallel matrix transpose algorithms on distributed memory concurrent computers

    Energy Technology Data Exchange (ETDEWEB)

    Choi, J.; Walker, D.W. [Oak Ridge National Lab., TN (United States); Dongarra, J.J. [Oak Ridge National Lab., TN (United States)]|[Univ. of Tennessee, Knoxville, TN (United States). Dept. of Computer Science

    1993-10-01

    This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  2. Specification for a reconfigurable optoelectronic VLSI processor suitable for digital signal processing.

    Science.gov (United States)

    Fey, D; Kasche, B; Burkert, C; Tschäche, O

    1998-01-10

    A concept for a parallel digital signal processor based on opticalinterconnections and optoelectronic VLSI circuits is presented. Itis shown that the proper combination of optical communication, architecture, and algorithms allows a throughput that outperformspurely electronic solutions. The usefulness of low-level algorithmsfrom the add-and-shift class is emphasized. These algorithms leadto fine-grain, massively parallel on-chip processor architectures withhigh demands for optical off-chip interconnections. A comparativeperformance analysis shows the superiority of a bit-serialarchitecture. This architecture is mapped onto an optoelectronicthree-dimensional circuit, and the necessary optical interconnectionscheme is specified.

  3. Geometric Design Rule Check of VLSI Layouts in Mesh Connected Processors

    Directory of Open Access Journals (Sweden)

    S. K. Nandy

    1994-01-01

    Full Text Available Design Rule Checking is a compute-intensive VLSI CAD tool. In this paper we propose a parallel algorithm to perform Design Rule Check (DRC of Layout geometries in a VLSI layout. The algorithm assumes the parallel architecture to be a two-dimensional mesh of processors. The algorithm is based on a linear quadtree representation of the layout. Through a complexity analysis it is shown that it is possible to achieve a linear speedup in DRC with respect to the number of processors.

  4. A programmable power processor for a 25 kW power module. [on Shuttle Orbiter

    Science.gov (United States)

    Kapustka, R. E.; Lanier, J. R., Jr.

    1978-01-01

    The paper presents the concept and major design problems of a programmable power processor for the 25 kW electrical power system for the Shuttle Orbiter. The load will be handled by three parallel power stages operated in phase sequence with each power transistor having its own commutating diode and filter inductor. The power stages will be run at a fixed frequency of 10 kHz with the 'on'-time variable up to 100%. The input filter bank in the breadboard programmable power processor is planned to be a series-parallel combination of tantalum cased tantalum wet-slug capacitors.

  5. Designing High-Performance Fuzzy Controllers Combining IP Cores and Soft Processors

    Directory of Open Access Journals (Sweden)

    Oscar Montiel-Ross

    2012-01-01

    Full Text Available This paper presents a methodology to integrate a fuzzy coprocessor described in VHDL (VHSIC Hardware Description Language to a soft processor embedded into an FPGA, which increases the throughput of the whole system, since the controller uses parallelism at the circuitry level for high-speed-demanding applications, the rest of the application can be written in C/C++. We used the ARM 32-bit soft processor, which allows sequential and parallel programming. The FLC coprocessor incorporates a tuning method that allows to manipulate the system response. We show experimental results using a fuzzy PD+I controller as the embedded coprocessor.

  6. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed to execute pattern matching with a high degree of parallelism. The AM system finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 828 2 Gbit/s serial links for a total in/out bandwidth of 56 Gb/s. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. ...

  7. The Serial Link Processor for the Fast TracKer (FTK) processor at ATLAS

    CERN Document Server

    Biesuz, Nicolo Vladi; The ATLAS collaboration; Luciano, Pierluigi; Magalotti, Daniel; Rossi, Enrico

    2015-01-01

    The Associative Memory (AM) system of the Fast Tracker (FTK) processor has been designed to perform pattern matching using the hit information of the ATLAS experiment silicon tracker. The AM is the heart of FTK and is mainly based on the use of ASICs (AM chips) designed on purpose to execute pattern matching with a high degree of parallelism. It finds track candidates at low resolution that are seeds for a full resolution track fitting. To solve the very challenging data traffic problems inside FTK, multiple board and chip designs have been performed. The currently proposed solution is named the “Serial Link Processor” and is based on an extremely powerful network of 2 Gb/s serial links. This paper reports on the design of the Serial Link Processor consisting of two types of boards, the Local Associative Memory Board (LAMB), a mezzanine where the AM chips are mounted, and the Associative Memory Board (AMB), a 9U VME board which holds and exercises four LAMBs. We report on the performance of the intermedia...

  8. Communications systems and methods for subsea processors

    Science.gov (United States)

    Gutierrez, Jose; Pereira, Luis

    2016-04-26

    A subsea processor may be located near the seabed of a drilling site and used to coordinate operations of underwater drilling components. The subsea processor may be enclosed in a single interchangeable unit that fits a receptor on an underwater drilling component, such as a blow-out preventer (BOP). The subsea processor may issue commands to control the BOP and receive measurements from sensors located throughout the BOP. A shared communications bus may interconnect the subsea processor and underwater components and the subsea processor and a surface or onshore network. The shared communications bus may be operated according to a time division multiple access (TDMA) scheme.

  9. Efficient Multidimensional Data Redistribution for Resizable Parallel Computations

    CERN Document Server

    Sudarsan, Rajesh

    2007-01-01

    Traditional parallel schedulers running on cluster supercomputers support only static scheduling, where the number of processors allocated to an application remains fixed throughout the execution of the job. This results in under-utilization of idle system resources thereby decreasing overall system throughput. In our research, we have developed a prototype framework called ReSHAPE, which supports dynamic resizing of parallel MPI applications executing on distributed memory platforms. The resizing library in ReSHAPE includes support for releasing and acquiring processors and efficiently redistributing application state to a new set of processors. In this paper, we derive an algorithm for redistributing two-dimensional block-cyclic arrays from $P$ to $Q$ processors, organized as 2-D processor grids. The algorithm ensures a contention-free communication schedule for data redistribution if $P_r \\leq Q_r$ and $P_c \\leq Q_c$. In other cases, the algorithm implements circular row and column shifts on the communicat...

  10. The UA1 trigger processor

    CERN Document Server

    Grayer, G H

    1981-01-01

    Experiment UA1 is a large multipurpose spectrometer at the CERN proton-antiproton collider. The principal trigger is formed on the basis of the energy deposition in calorimeters. A trigger decision taken in under 2.4 microseconds can avoid dead-time losses due to the bunched nature of the beam. To achieve this fast 8-bit charge to digital converters have been built followed by two identical digital processors tailored to the experiment. The outputs of groups of the 2440 photomultipliers in the calorimeters are summed to form a total of 288 input channels to the ADCs. A look-up table in RAM is used to convert the digitised photomultiplier signals to energy in one processor, and to transverse energy in the other. Each processor forms four sums from a chosen combination of input channels, and also counts the number of clusters with electromagnetic or hadronic energy above pre-determined levels. Up to twelve combinations of these conditions, together with external information, may be combined in coincidence or in...

  11. Taxonomy of Data Prefetching for Multicore Processors

    Institute of Scientific and Technical Information of China (English)

    Surendra Byna; Yong Chen; Xian-He Sun

    2009-01-01

    Data prefetching is an effective data access latency hiding technique to mask the CPU stall caused by cache misses and to bridge the performance gap between processor and memory. With hardware and/or software support, data prefetching brings data closer to a processor before it is actually needed. Many prefetching techniques have been developed for single-core processors. Recent developments in processor technology have brought multicore processors into mainstream.While some of the single-core prefetching techniques are directly applicable to multicore processors, numerous novel strategies have been proposed in the past few years to take advantage of multiple cores. This paper aims to provide a comprehensive review of the state-of-the-art prefetching techniques, and proposes a taxonomy that classifies various design concerns in developing a prefetching strategy, especially for multicore processors. We compare various existing methods through analysis as well.

  12. Behavioral Simulation and Performance Evaluation of Multi-Processor Architectures

    Directory of Open Access Journals (Sweden)

    Ausif Mahmood

    1996-01-01

    Full Text Available The development of multi-processor architectures requires extensive behavioral simulations to verify the correctness of design and to evaluate its performance. A high level language can provide maximum flexibility in this respect if the constructs for handling concurrent processes and a time mapping mechanism are added. This paper describes a novel technique for emulating hardware processes involved in a parallel architecture such that an object-oriented description of the design is maintained. The communication and synchronization between hardware processes is handled by splitting the processes into their equivalent subprograms at the entry points. The proper scheduling of these subprograms is coordinated by a timing wheel which provides a time mapping mechanism. Finally, a high level language pre-processor is proposed so that the timing wheel and the process emulation details can be made transparent to the user.

  13. NMRFx Processor: a cross-platform NMR data processing program.

    Science.gov (United States)

    Norris, Michael; Fetler, Bayard; Marchant, Jan; Johnson, Bruce A

    2016-08-01

    NMRFx Processor is a new program for the processing of NMR data. Written in the Java programming language, NMRFx Processor is a cross-platform application and runs on Linux, Mac OS X and Windows operating systems. The application can be run in both a graphical user interface (GUI) mode and from the command line. Processing scripts are written in the Python programming language and executed so that the low-level Java commands are automatically run in parallel on computers with multiple cores or CPUs. Processing scripts can be generated automatically from the parameters of NMR experiments or interactively constructed in the GUI. A wide variety of processing operations are provided, including methods for processing of non-uniformly sampled datasets using iterative soft thresholding. The interactive GUI also enables the use of the program as an educational tool for teaching basic and advanced techniques in NMR data analysis.

  14. The fast tracker processor for hadron collider triggers

    CERN Document Server

    Annovi, A; Bardi, A; Carosi, R; Dell'Orso, Mauro; D'Onofrio, M; Giannetti, P; Iannaccone, G; Morsani, E; Pietri, M; Varotto, G

    2001-01-01

    Perspectives for precise and fast track reconstruction in future hadron collider experiments are addressed. We discuss the feasibility of a pipelined highly parallel processor dedicated to the implementation of a very fast tracking algorithm. The algorithm is based on the use of a large bank of pre-stored combinations of trajectory points, called patterns, for extremely complex tracking systems. The CMS experiment at LHC is used as a benchmark. Tracking data from the events selected by the level-1 trigger are sorted and filtered by the Fast Tracker processor at an input rate of 100 kHz. This data organization allows the level-2 trigger logic to reconstruct full resolution tracks with transverse momentum above a few GeV and search for secondary vertices within typical level-2 times. (15 refs).

  15. A Parallel Algorithm for Contact in a Finite Element Hydrocode

    Energy Technology Data Exchange (ETDEWEB)

    Pierce, T G

    2003-06-01

    A parallel algorithm is developed for contact/impact of multiple three dimensional bodies undergoing large deformation. As time progresses the relative positions of contact between the multiple bodies changes as collision and sliding occurs. The parallel algorithm is capable of tracking these changes and enforcing an impenetrability constraint and momentum transfer across the surfaces in contact. Portions of the various surfaces of the bodies are assigned to the processors of a distributed-memory parallel machine in an arbitrary fashion, known as the primary decomposition. A secondary, dynamic decomposition is utilized to bring opposing sections of the contacting surfaces together on the same processors, so that opposing forces may be balanced and the resultant deformation of the bodies calculated. The secondary decomposition is accomplished and updated using only local communication with a limited subset of neighbor processors. Each processor represents both a domain of the primary decomposition and a domain of the secondary, or contact, decomposition. Thus each processor has four sets of neighbor processors: (a) those processors which represent regions adjacent to it in the primary decomposition, (b) those processors which represent regions adjacent to it in the contact decomposition, (c) those processors which send it the data from which it constructs its contact domain, and (d) those processors to which it sends its primary domain data, from which they construct their contact domains. The latter three of these neighbor sets change dynamically as the simulation progresses. By constraining all communication to these sets of neighbors, all global communication, with its attendant nonscalable performance, is avoided. A set of tests are provided to measure the degree of scalability achieved by this algorithm on up to 1024 processors. Issues related to the operating system of the test platform which lead to some degradation of the results are analyzed. This algorithm

  16. Real-time optical processor prototype for remote SAR applications

    Science.gov (United States)

    Marchese, Linda; Doucet, Michel; Harnisch, Bernd; Suess, Martin; Bourqui, Pascal; Legros, Mathieu; Desnoyers, Nichola; Guillot, Ludovic; Mercier, Luc; Savard, Maxime; Martel, Anne; Châteauneuf, François; Bergeron, Alain

    2009-09-01

    A Compact Real-Time Optical SAR Processor has been successfully developed and tested. SAR, or Synthetic Aperture Radar, is a powerful tool providing enhanced day and night imaging capabilities. SAR systems typically generate large amounts of information generally in the form of complex data that are difficult to compress. Specifically, for planetary missions and unmanned aerial vehicle (UAV) systems with limited communication data rates this is a clear disadvantage. SAR images are typically processed electronically applying dedicated Fourier transformations. This, however, can also be performed optically in real-time. Indeed, the first SAR images have been optically processed. The optical processor architecture provides inherent parallel computing capabilities that can be used advantageously for the SAR data processing. Onboard SAR image generation would provide local access to processed information paving the way for real-time decision-making. This could eventually benefit navigation strategy and instrument orientation decisions. Moreover, for interplanetary missions, onboard analysis of images could provide important feature identification clues and could help select the appropriate images to be transmitted to Earth, consequently helping bandwidth management. This could ultimately reduce the data throughput requirements and related transmission bandwidth. This paper reviews the design of a compact optical SAR processor prototype that would reduce power, weight, and size requirements and reviews the analysis of SAR image generation using the table-top optical processor. Various SAR processor parameters such as processing capabilities, image quality (point target analysis), weight and size are reviewed. Results of image generation from simulated point targets as well as real satellite-acquired raw data are presented.

  17. Lock-Free Parallel Access Collections

    Directory of Open Access Journals (Sweden)

    Bruce P. Lester

    2014-06-01

    Full Text Available All new computers have multicore processors. To exploit this hardware parallelism for improved performance, the predominant approach today is multithreading using shared variables and locks. This approach has potential data races that can create a nondeterministic program. This paper presents a promising new approach to parallel programming that is both lock-free and deterministic. The standard forall primitive for parallel execution of for-loop iterations is extended into a more highly structured primitive called a Parallel Operation (POP. Each parallel process created by a POP may read shared variables (or shared collections freely. Shared collections modified by a POP must be selected from a special set of predefined Parallel Access Collections (PAC. Each PAC has several Write Modes that govern parallel updates in a deterministic way. This paper presents an overview of a Prototype Library that implements this POP-PAC approach for the C++ language, including performance results for two benchmark parallel programs.

  18. Method and apparatus for interconnecting processors in a hyper-dimensional array

    Energy Technology Data Exchange (ETDEWEB)

    Thiel, T.; Clayton, R.; Feyman, C.; Kahle, B.

    1989-02-14

    This patent describes a parallel processor having an array of integrated circuits, each being addressed by a multibit binary address and comprising at least one processor and being interconnected to each of its nearest neighbors in a pattern of a Boolean cube of more than three dimensions, apparatus for interconnecting the integrated circuits on boards and backplanes comprising: means for interconnecting on each circuit board the integrated circuits which are nearest neighbors in dimensions one through L; means for interconnecting on each backplane the integrated circuits which are nearest neighbors in dimensions L+1 through M; means for interconnecting from one backplane to another the integrated circuits which are nearest neighbors in dimensions M+1 through N; and means for adjusting the clock cycle during the routing of messages between processors in accordance with the dimension of the processors between which the message is being routed.

  19. Evaluation and Choice of Various Branch Predictors for Low-Power Embedded Processor

    Institute of Scientific and Technical Information of China (English)

    FAN DongRui (范东睿); YANG HongBo (杨洪波); GAO GuangRong (高光荣); ZHAO RongCai (赵荣彩)

    2003-01-01

    Power is an important design constraint in embedded computing systems. To meet the power constraint, microarchitecture and hardware designed to achieve high performance need to be revisited, from both performance and power angles. This paper studies one of them:branch predictor. As well known, branch prediction is critical to exploit instruction level parallelism effectively, but may incur additional power consumption due to the hardware resource dedicated for branch prediction and the extra power consumed on mispredicted branches. This paper explores the design space of branch prediction mechanisms and tries to find the most beneficial one to realize low-power embedded processor. The sample processor studied is Godson-like processor, which is a dual-issue, out-of-order processor with deep pipeline, supporting MIPS instruction set.

  20. Monolithically integrated absolute frequency comb laser system

    Energy Technology Data Exchange (ETDEWEB)

    Wanke, Michael C.

    2016-07-12

    Rather than down-convert optical frequencies, a QCL laser system directly generates a THz frequency comb in a compact monolithically integrated chip that can be locked to an absolute frequency without the need of a frequency-comb synthesizer. The monolithic, absolute frequency comb can provide a THz frequency reference and tool for high-resolution broad band spectroscopy.

  1. Nanosecond monolithic CMOS readout cell

    Science.gov (United States)

    Souchkov, Vitali V.

    2004-08-24

    A pulse shaper is implemented in monolithic CMOS with a delay unit formed of a unity gain buffer. The shaper is formed of a difference amplifier having one input connected directly to an input signal and a second input connected to a delayed input signal through the buffer. An elementary cell is based on the pulse shaper and a timing circuit which gates the output of an integrator connected to the pulse shaper output. A detector readout system is formed of a plurality of elementary cells, each connected to a pixel of a pixel array, or to a microstrip of a plurality of microstrips, or to a detector segment.

  2. Compact monolithic capacitive discharge unit

    Science.gov (United States)

    Roesler, Alexander W.; Vernon, George E.; Hoke, Darren A.; De Marquis, Virginia K.; Harris, Steven M.

    2007-06-26

    A compact monolithic capacitive discharge unit (CDU) is disclosed in which a thyristor switch and a flyback charging circuit are both sandwiched about a ceramic energy storage capacitor. The result is a compact rugged assembly which provides a low-inductance current discharge path. The flyback charging circuit preferably includes a low-temperature co-fired ceramic transformer. The CDU can further include one or more ceramic substrates for enclosing the thyristor switch and for holding various passive components used in the flyback charging circuit. A load such as a detonator can also be attached directly to the CDU.

  3. Development and Investigation of Evacuated Windows Based on Monolithic Silica Xerogel Spacers

    DEFF Research Database (Denmark)

    Schultz, Jørgen Munthe; Svendsen, Sv Aa Højgaard

    The objective of the project is to develop and investigate insulating glazings based on evacuated monolithic silica xerogel spacers. Since the starting date January 1, 1994 the project has been closely connected to the parallel project "Development and Investigation of Evacuated Windows based...... will be approximately 0.013 W/(m K) which is approximately 33% of the value for commonly used insulation materials, e.g. mineral wool. Monolithic silica xerogel is a highly porous material (pore volume up to 90%) with a solar transmittance of 50% (thickness = 20 mm). However, if the silica xerogel is not made...... and 3) application for insulating glazings.Scientific developments have made it possible to prepare low density monolithic silica xerogels, only from about 1990, and developments in both the production process as well as size of the samples are necessary for a commercial use of the material...

  4. Massively Parallel Finite Element Programming

    KAUST Repository

    Heister, Timo

    2010-01-01

    Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.

  5. Biasable, Balanced, Fundamental Submillimeter Monolithic Membrane Mixer

    Science.gov (United States)

    Siegel, Peter; Schlecht, Erich; Mehdi, Imran; Gill, John; Velebir, James; Tsang, Raymond; Dengler, Robert; Lin, Robert

    2010-01-01

    This device is a biasable, submillimeter-wave, balanced mixer fabricated using JPL s monolithic membrane process a simplified version of planar membrane technology. The primary target application is instrumentation used for analysis of atmospheric constituents, pressure, temperature, winds, and other physical and chemical properties of the atmospheres of planets and comets. Other applications include high-sensitivity gas detection and analysis. This innovation uses a balanced configuration of two diodes allowing the radio frequency (RF) signal and local oscillator (LO) inputs to be separated. This removes the need for external diplexers that are inherently narrowband, bulky, and require mechanical tuning to change frequency. Additionally, this mixer uses DC bias-ability to improve its performance and versatility. In order to solve problems relating to circuit size, the GaAs membrane process was created. As much of the circuitry as possible is fabricated on-chip, making the circuit monolithic. The remainder of the circuitry is precision-machined into a waveguide block that holds the GaAs circuit. The most critical alignments are performed using micron-scale semiconductor technology, enabling wide bandwidth and high operating frequencies. The balanced mixer gets superior performance with less than 2 mW of LO power. This can be provided by a simple two-stage multiplier chain following an amplifier at around 90 GHz. Further, the diodes are arranged so that they can be biased. Biasing pushes the diodes closer to their switching voltage, so that less LO power is required to switch the diodes on and off. In the photo, the diodes are at the right end of the circuit. The LO comes from the waveguide at the right into a reduced-height section containing the diodes. Because the diodes are in series to the LO signal, they are both turned on and off simultaneously once per LO cycle. Conversely, the RF signal is picked up from the RF waveguide by the probe at the left, and flows

  6. Parallel Jacobi EVD Methods on Integrated Circuits

    Directory of Open Access Journals (Sweden)

    Chi-Chia Sun

    2014-01-01

    Full Text Available Design strategies for parallel iterative algorithms are presented. In order to further study different tradeoff strategies in design criteria for integrated circuits, A 10 × 10 Jacobi Brent-Luk-EVD array with the simplified μ-CORDIC processor is used as an example. The experimental results show that using the μ-CORDIC processor is beneficial for the design criteria as it yields a smaller area, faster overall computation time, and less energy consumption than the regular CORDIC processor. It is worth to notice that the proposed parallel EVD method can be applied to real-time and low-power array signal processing algorithms performing beamforming or DOA estimation.

  7. Parallel Programming in the Age of Ubiquitous Parallelism

    Science.gov (United States)

    Pingali, Keshav

    2014-04-01

    Multicore and manycore processors are now ubiquitous, but parallel programming remains as difficult as it was 30-40 years ago. During this time, our community has explored many promising approaches including functional and dataflow languages, logic programming, and automatic parallelization using program analysis and restructuring, but none of these approaches has succeeded except in a few niche application areas. In this talk, I will argue that these problems arise largely from the computation-centric foundations and abstractions that we currently use to think about parallelism. In their place, I will propose a novel data-centric foundation for parallel programming called the operator formulation in which algorithms are described in terms of actions on data. The operator formulation shows that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous even in complex, irregular graph applications such as mesh generation/refinement/partitioning and SAT solvers. Regular algorithms emerge as a special case of irregular ones, and many application-specific optimization techniques can be generalized to a broader context. The operator formulation also leads to a structural analysis of algorithms called TAO-analysis that provides implementation guidelines for exploiting parallelism efficiently. Finally, I will describe a system called Galois based on these ideas for exploiting amorphous data-parallelism on multicores and GPUs

  8. .NET 4.5 parallel extensions

    CERN Document Server

    Freeman, Bryan

    2013-01-01

    This book contains practical recipes on everything you will need to create task-based parallel programs using C#, .NET 4.5, and Visual Studio. The book is packed with illustrated code examples to create scalable programs.This book is intended to help experienced C# developers write applications that leverage the power of modern multicore processors. It provides the necessary knowledge for an experienced C# developer to work with .NET parallelism APIs. Previous experience of writing multithreaded applications is not necessary.

  9. Biomolecular simulation on thousands of processors

    Science.gov (United States)

    Phillips, James Christopher

    Classical molecular dynamics simulation is a generally applicable method for the study of biomolecular aggregates of proteins, lipids, and nucleic acids. As experimental techniques have revealed the structures of larger and more complex biomolecular machines, the time required to complete even a single meaningful simulation of such systems has become prohibitive. We have developed the program NAMD to simulate systems of 50,000--500,000 atoms efficiently with full electrostatics on parallel computers with 1000 and more processors. NAMD's scalability is achieved through latency tolerant adaptive message-driven execution and measurement-based load balancing. NAMD is implemented in C++ and uses object-oriented design and threads to shield the basic algorithms from the necessary complexity of high-performance parallel execution. Apolipoprotein A-I is the primary protein constituent of high density lipoprotein particles, which transport cholesterol in the bloodstream. In collaboration with A. Jonas, we have constructed and simulated models of the nascent discoidal form of these particles, providing theoretical insight to the debate regarding the lipid-bound structure of the protein. Recently, S. Sligar and coworkers have created 10 nm phospholipid bilayer nanoparticles comprising a small lipid bilayer disk solubilized by synthetic membrane scaffold proteins derived from apolipoprotein A-I. Membrane proteins may be embedded in the water-soluble disks, with various medical and technological applications. We are working to develop variant scaffold proteins that produce disks of greater size, stability, and homogeneity. Our simulations have demonstrated a significant deviation from idealized cylindrical structure, and are being used in the interpretation of small angle x-ray scattering data.

  10. Design of highly efficient elliptic curve crypto-processor with two multiplications over GF(2163)

    Institute of Scientific and Technical Information of China (English)

    DAN Yong-ping; ZOU Xue-cheng; LIU Zheng-lin; HAN Yu; YI Li-hua

    2009-01-01

    In this article, a parallel hardware processor is presented to compute elliptic curve scalar multiplication in polynomial basis representation. The processor is applicable to the operations of scalar multiplication by using a modular arithmetic logic unit (MALU). The MALU consists of two multiplications, one addition, and one squaring. The two multiplications and the addition or squaring can be computed in parallel. The whole computations of scalar multiplication over GF(2163) can be performed in 3 064 cycles. The simulation results based on Xilinx Virtex2 XC2V6000 FPGAs show that the proposed design can compute random GF(2163) elliptic curve scalar multiplication operations in 31.17 μs, and the resource occupies 3 994 registers and 15 527 LUTs, which indicates that the crypto-processor is suitable for high-performance application.

  11. Graph Partitioning Models for Parallel Computing

    Energy Technology Data Exchange (ETDEWEB)

    Hendrickson, B.; Kolda, T.G.

    1999-03-02

    Calculations can naturally be described as graphs in which vertices represent computation and edges reflect data dependencies. By partitioning the vertices of a graph, the calculation can be divided among processors of a parallel computer. However, the standard methodology for graph partitioning minimizes the wrong metric and lacks expressibility. We survey several recently proposed alternatives and discuss their relative merits.

  12. Parallel Clustering Algorithms for Structured AMR

    Energy Technology Data Exchange (ETDEWEB)

    Gunney, B T; Wissink, A M; Hysom, D A

    2005-10-26

    We compare several different parallel implementation approaches for the clustering operations performed during adaptive gridding operations in patch-based structured adaptive mesh refinement (SAMR) applications. Specifically, we target the clustering algorithm of Berger and Rigoutsos (BR91), which is commonly used in many SAMR applications. The baseline for comparison is a simplistic parallel extension of the original algorithm that works well for up to O(10{sup 2}) processors. Our goal is a clustering algorithm for machines of up to O(10{sup 5}) processors, such as the 64K-processor IBM BlueGene/Light system. We first present an algorithm that avoids the unneeded communications of the simplistic approach to improve the clustering speed by up to an order of magnitude. We then present a new task-parallel implementation to further reduce communication wait time, adding another order of magnitude of improvement. The new algorithms also exhibit more favorable scaling behavior for our test problems. Performance is evaluated on a number of large scale parallel computer systems, including a 16K-processor BlueGene/Light system.

  13. Resource Centered Computing delivering high parallel performance

    OpenAIRE

    2014-01-01

    International audience; Modern parallel programming requires a combination of differentparadigms, expertise and tuning, that correspond to the differentlevels in today's hierarchical architectures. To cope with theinherent difficulty, ORWL (ordered read-write locks) presents a newparadigm and toolbox centered around local or remote resources, suchas data, processors or accelerators. ORWL programmers describe theircomputation in terms of access to these resources during criticalsections. Exclu...

  14. Functional Verification of Enhanced RISC Processor

    OpenAIRE

    SHANKER NILANGI; SOWMYA L

    2013-01-01

    This paper presents design and verification of a 32-bit enhanced RISC processor core having floating point computations integrated within the core, has been designed to reduce the cost and complexity. The designed 3 stage pipelined 32-bit RISC processor is based on the ARM7 processor architecture with single precision floating point multiplier, floating point adder/subtractor for floating point operations and 32 x 32 booths multiplier added to the integer core of ARM7. The binary representati...

  15. Digital Signal Processor For GPS Receivers

    Science.gov (United States)

    Thomas, J. B.; Meehan, T. K.; Srinivasan, J. M.

    1989-01-01

    Three innovative components combined to produce all-digital signal processor with superior characteristics: outstanding accuracy, high-dynamics tracking, versatile integration times, lower loss-of-lock signal strengths, and infrequent cycle slips. Three components are digital chip advancer, digital carrier downconverter and code correlator, and digital tracking processor. All-digital signal processor intended for use in receivers of Global Positioning System (GPS) for geodesy, geodynamics, high-dynamics tracking, and ionospheric calibration.

  16. Design Principles for Synthesizable Processor Cores

    DEFF Research Database (Denmark)

    Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

    2012-01-01

    As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...... through the use of micro-benchmarks that our principles guide the design of a processor core that improves performance by an average of 38% over a similar Xilinx MicroBlaze configuration....

  17. Addressing Thermal and Performance Variability Issues in Dynamic Processors

    Energy Technology Data Exchange (ETDEWEB)

    Yoshii, Kazutomo [Argonne National Lab. (ANL), Argonne, IL (United States); Llopis, Pablo [Univ. Carlos III de Madrid (Spain); Zhang, Kaicheng [Northwestern Univ., Evanston, IL (United States); Luo, Yingyi [Northwestern Univ., Evanston, IL (United States); Ogrenci-Memik, Seda [Northwestern Univ., Evanston, IL (United States); Memik, Gokhan [Northwestern Univ., Evanston, IL (United States); Sankaran, Rajesh [Argonne National Lab. (ANL), Argonne, IL (United States); Beckman, Pete [Argonne National Lab. (ANL), Argonne, IL (United States)

    2017-03-01

    As CMOS scaling nears its end, parameter variations (process, temperature and voltage) are becoming a major concern. To overcome parameter variations and provide stability, modern processors are becoming dynamic, opportunistically adjusting voltage and frequency based on thermal and energy constraints, which negatively impacts traditional bulk-synchronous parallelism-minded hardware and software designs. As node-level architecture is growing in complexity, implementing variation control mechanisms only with hardware can be a challenging task. In this paper we investigate a software strategy to manage hardwareinduced variations, leveraging low-level monitoring/controlling mechanisms.

  18. Negative base encoding in optical linear algebra processors

    Science.gov (United States)

    Perlee, C.; Casasent, D.

    1986-01-01

    In the digital multiplication by analog convolution algorithm, the bits of two encoded numbers are convolved to form the product of the two numbers in mixed binary representation; this output can be easily converted to binary. Attention is presently given to negative base encoding, treating base -2 initially, and then showing that the negative base system can be readily extended to any radix. In general, negative base encoding in optical linear algebra processors represents a more efficient technique than either sign magnitude or 2's complement encoding, when the additions of digitally encoded products are performed in parallel.

  19. The case for a generic implant processor.

    Science.gov (United States)

    Strydis, Christos; Gaydadjiev, Georgi N

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably selected processor simulator, various operational aspects of the application are being monitored. Findings on performance, cache behavior, branch prediction, power consumption, energy expenditure and instruction mixes are presented and analyzed. The suitability of such an implant processor and directions for future work are given.

  20. Microfluidic devices and methods including porous polymer monoliths

    Science.gov (United States)

    Hatch, Anson V; Sommer, Gregory J; Singh, Anup K; Wang, Ying-Chih; Abhyankar, Vinay V

    2014-04-22

    Microfluidic devices and methods including porous polymer monoliths are described. Polymerization techniques may be used to generate porous polymer monoliths having pores defined by a liquid component of a fluid mixture. The fluid mixture may contain iniferters and the resulting porous polymer monolith may include surfaces terminated with iniferter species. Capture molecules may then be grafted to the monolith pores.

  1. Alternative Water Processor Test Development

    Science.gov (United States)

    Pickering, Karen D.; Mitchell, Julie; Vega, Leticia; Adam, Niklas; Flynn, Michael; Wjee (er. Rau); Lunn, Griffin; Jackson, Andrew

    2012-01-01

    The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrogen and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.

  2. Alternative Water Processor Test Development

    Science.gov (United States)

    Pickering, Karen D.; Mitchell, Julie L.; Adam, Niklas M.; Barta, Daniel; Meyer, Caitlin E.; Pensinger, Stuart; Vega, Leticia M.; Callahan, Michael R.; Flynn, Michael; Wheeler, Ray; hide

    2013-01-01

    The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrification and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.

  3. Vectorization, parallelization and porting of nuclear codes. Vectorization and parallelization. Progress report fiscal 1999

    Energy Technology Data Exchange (ETDEWEB)

    Adachi, Masaaki; Ogasawara, Shinobu; Kume, Etsuo [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment; Ishizuki, Shigeru; Nemoto, Toshiyuki; Kawasaki, Nobuo; Kawai, Wataru [Fujitsu Ltd., Tokyo (Japan); Yatake, Yo-ichi [Hitachi Ltd., Tokyo (Japan)

    2001-02-01

    Several computer codes in the nuclear field have been vectorized, parallelized and trans-ported on the FUJITSU VPP500 system, the AP3000 system, the SX-4 system and the Paragon system at Center for Promotion of Computational Science and Engineering in Japan Atomic Energy Research Institute. We dealt with 18 codes in fiscal 1999. These results are reported in 3 parts, i.e., the vectorization and the parallelization part on vector processors, the parallelization part on scalar processors and the porting part. In this report, we describe the vectorization and parallelization on vector processors. In this vectorization and parallelization on vector processors part, the vectorization of Relativistic Molecular Orbital Calculation code RSCAT, a microscopic transport code for high energy nuclear collisions code JAM, three-dimensional non-steady thermal-fluid analysis code STREAM, Relativistic Density Functional Theory code RDFT and High Speed Three-Dimensional Nodal Diffusion code MOSRA-Light on the VPP500 system and the SX-4 system are described. (author)

  4. Synchronizing Parallel Tasks Using STM

    Directory of Open Access Journals (Sweden)

    Ryan Saptarshi Ray

    2015-03-01

    Full Text Available The past few years have marked the start of a historic transition from sequential to parallel computation. The necessity to write parallel programs is increasing as systems are getting more complex while processor speed increases are slowing down. Current parallel programming uses low-level programming constructs like threads and explicit synchronization using locks to coordinate thread execution. Parallel programs written with these constructs are difficult to design, program and debug. Also locks have many drawbacks which make them a suboptimal solution. One such drawback is that locks should be only used to enclose the critical section of the parallel-processing code. If locks are used to enclose the entire code then the performance of the code drastically decreases. Software Transactional Memory (STM is a promising new approach to programming shared-memory parallel processors. It is a concurrency control mechanism that is widely considered to be easier to use by programmers than locking. It allows portions of a program to execute in isolation, without regard to other, concurrently executing tasks. A programmer can reason about the correctness of code within a transaction and need not worry about complex interactions with other, concurrently executing parts of the program. If STM is used to enclose the entire code then the performance of the code is the same as that of the code in which STM is used to enclose the critical section only and is far better than code in which locks have been used to enclose the entire code. So STM is easier to use than locks as critical section does not need to be identified in case of STM. This paper shows the concept of writing code using Software Transactional Memory (STM and the performance comparison of codes using locks with those using STM. It also shows why the use of STM in parallel-processing code is better than the use of locks.

  5. Parallel optimization algorithms and their implementation in VLSI design

    Science.gov (United States)

    Lee, G.; Feeley, J. J.

    1991-01-01

    Two new parallel optimization algorithms based on the simplex method are described. They may be executed by a SIMD parallel processor architecture and be implemented in VLSI design. Several VLSI design implementations are introduced. An application example is reported to demonstrate that the algorithms are effective.

  6. Selecting Simulation Models when Predicting Parallel Program Behaviour

    OpenAIRE

    Broberg, Magnus; Lundberg, Lars; Grahn, Håkan

    2002-01-01

    The use of multiprocessors is an important way to increase the performance of a supercom-puting program. This means that the program has to be parallelized to make use of the multi-ple processors. The parallelization is unfortunately not an easy task. Development tools supporting parallel programs are important. Further, it is the customer that decides the number of processors in the target machine, and as a result the developer has to make sure that the pro-gram runs efficiently on any numbe...

  7. Parallel community climate model: Description and user`s guide

    Energy Technology Data Exchange (ETDEWEB)

    Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H. [and others

    1996-07-15

    This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.

  8. Monolithic Continuous-Flow Bioreactors

    Science.gov (United States)

    Stephanopoulos, Gregory; Kornfield, Julia A.; Voecks, Gerald A.

    1993-01-01

    Monolithic ceramic matrices containing many small flow passages useful as continuous-flow bioreactors. Ceramic matrix containing passages made by extruding and firing suitable ceramic. Pores in matrix provide attachment medium for film of cells and allow free movement of solution. Material one not toxic to micro-organisms grown in reactor. In reactor, liquid nutrients flow over, and liquid reaction products flow from, cell culture immobilized in one set of channels while oxygen flows to, and gaseous reaction products flow from, culture in adjacent set of passages. Cells live on inner surfaces containing flowing nutrient and in pores of walls of passages. Ready access to nutrients and oxygen in channels. They generate continuous high yield characteristic of immobilized cells, without large expenditure of energy otherwise incurred if necessary to pump nutrient solution through dense biomass as in bioreactors of other types.

  9. Dynamic Distribution Model with Prime Granularity for Parallel Computing

    Institute of Scientific and Technical Information of China (English)

    2005-01-01

    Dynamic distribution model is one of the best schemes for parallel volume rendering. However, in homogeneous cluster system, since the granularity is traditionally identical, all processors communicate almost simultaneously and computation load may lose balance. Due to problems above, a dynamic distribution model with prime granularity for parallel computing is presented.Granularities of each processor are relatively prime, and related theories are introduced. A high parallel performance can be achieved by minimizing network competition and using a load balancing strategy that ensures all processors finish almost simultaneously. Based on Master-Slave-Gleaner (MSG) scheme, the parallel Splatting Algorithm for volume rendering is used to test the model on IBM Cluster 1350 system. The experimental results show that the model can bring a considerable improvement in performance, including computation efficiency, total execution time, speed, and load balancing.

  10. Anisotropically structured magnetic aerogel monoliths

    Science.gov (United States)

    Heiligtag, Florian J.; Airaghi Leccardi, Marta J. I.; Erdem, Derya; Süess, Martin J.; Niederberger, Markus

    2014-10-01

    Texturing of magnetic ceramics and composites by aligning and fixing of colloidal particles in a magnetic field is a powerful strategy to induce anisotropic chemical, physical and especially mechanical properties into bulk materials. If porosity could be introduced, anisotropically structured magnetic materials would be the perfect supports for magnetic separations in biotechnology or for magnetic field-assisted chemical reactions. Aerogels, combining high porosity with nanoscale structural features, offer an exceptionally large surface area, but they are difficult to magnetically texture. Here we present the preparation of anatase-magnetite aerogel monoliths via the assembly of preformed nanocrystallites. Different approaches are proposed to produce macroscopic bodies with gradient-like magnetic segmentation or with strongly anisotropic magnetic texture.Texturing of magnetic ceramics and composites by aligning and fixing of colloidal particles in a magnetic field is a powerful strategy to induce anisotropic chemical, physical and especially mechanical properties into bulk materials. If porosity could be introduced, anisotropically structured magnetic materials would be the perfect supports for magnetic separations in biotechnology or for magnetic field-assisted chemical reactions. Aerogels, combining high porosity with nanoscale structural features, offer an exceptionally large surface area, but they are difficult to magnetically texture. Here we present the preparation of anatase-magnetite aerogel monoliths via the assembly of preformed nanocrystallites. Different approaches are proposed to produce macroscopic bodies with gradient-like magnetic segmentation or with strongly anisotropic magnetic texture. Electronic supplementary information (ESI) available: Digital photographs of dispersions and gels with different water-to-ethanol ratios; magnetic measurements of an anatase aerogel containing 0.25 mol% Fe3O4 nanoparticles; XRD patterns of the iron oxide and

  11. Monolithic cells for solar fuels.

    Science.gov (United States)

    Rongé, Jan; Bosserez, Tom; Martel, David; Nervi, Carlo; Boarino, Luca; Taulelle, Francis; Decher, Gero; Bordiga, Silvia; Martens, Johan A

    2014-12-07

    Hybrid energy generation models based on a variety of alternative energy supply technologies are considered the best way to cope with the depletion of fossil energy resources and to limit global warming. One of the currently missing technologies is the mimic of natural photosynthesis to convert carbon dioxide and water into chemical fuel using sunlight. This idea has been around for decades, but artificial photosynthesis of organic molecules is still far away from providing real-world solutions. The scientific challenge is to perform in an efficient way the multi-electron transfer reactions of water oxidation and carbon dioxide reduction using holes and single electrons generated in an illuminated semiconductor. In this tutorial review the design of photoelectrochemical (PEC) cells that combine solar water oxidation and CO2 reduction is discussed. In such PEC cells simultaneous transport and efficient use of light, electrons, protons and molecules has to be managed. It is explained how efficiency can be gained by compartmentalisation of the water oxidation and CO2 reduction processes by proton exchange membranes, and monolithic concepts of artificial leaves and solar membranes are presented. Besides transferring protons from the anode to the cathode compartment the membrane serves as a molecular barrier material to prevent cross-over of oxygen and fuel molecules. Innovative nano-organized multimaterials will be needed to realise practical artificial photosynthesis devices. This review provides an overview of synthesis techniques which could be used to realise monolithic multifunctional membrane-electrode assemblies, such as Layer-by-Layer (LbL) deposition, Atomic Layer Deposition (ALD), and porous silicon (porSi) engineering. Advances in modelling approaches, electrochemical techniques and in situ spectroscopies to characterise overall PEC cell performance are discussed.

  12. High-resolution real-time imaging processor for airborne SAR

    Science.gov (United States)

    Yu, Weidong; Wu, Shumei

    2003-04-01

    Real-time imaging processor can provide Synthetic Aperture Radar (SAR) image in real-time mode, which is necessary for airborne SAR applications such as real-time monitoring and battle reconnaissance. This paper describes the development of high-resolution real-time imaging processor in Institute of Electronic, Chinese Academy of Sciences (IECAS). The processor uses parallel multiple channels to implement large-volume calculation needed for SAR real-time imaging. A sub-aperture method is utilized to divide azimuth Doppler spectrum into two parts, which correspond two looks. With sub-aperture method, high processing efficiency, less range migration effect and reduced memory volume can be achieved. The imaging swath is also divided into two segments, which are processed in a parallel way. Range-Doppler algorithm, which consists of range migration correction and azimuth compression, is implemented in the processor. Elaborate software programming ensures a high efficient utilization of hardware. Experimental simulation and field flight indicate this system is successful. The principles, architecture, hardware implementation of the processor are presented in this paper in details.

  13. The TM3270 Media-processor

    NARCIS (Netherlands)

    van de Waerdt, J.W.

    2006-01-01

    I n this thesis, we present the TM3270 VLIW media-processor, the latest of TriMedia processors, and describe the innovations with respect to its prede- cessor: the TM3260. We describe enhancements to the load/store unit design, such as a new data prefetching technique, and architectural

  14. Multi-output programmable quantum processor

    OpenAIRE

    Yu, Yafei; Feng, Jian; Zhan, Mingsheng

    2002-01-01

    By combining telecloning and programmable quantum gate array presented by Nielsen and Chuang [Phys.Rev.Lett. 79 :321(1997)], we propose a programmable quantum processor which can be programmed to implement restricted set of operations with several identical data outputs. The outputs are approximately-transformed versions of input data. The processor successes with certain probability.

  15. 7 CFR 1215.14 - Processor.

    Science.gov (United States)

    2010-01-01

    ... AND ORDERS; MISCELLANEOUS COMMODITIES), DEPARTMENT OF AGRICULTURE POPCORN PROMOTION, RESEARCH, AND CONSUMER INFORMATION Popcorn Promotion, Research, and Consumer Information Order Definitions § 1215.14 Processor. Processor means a person engaged in the preparation of unpopped popcorn for the market who...

  16. The TM3270 Media-processor

    NARCIS (Netherlands)

    van de Waerdt, J.W.

    2006-01-01

    I n this thesis, we present the TM3270 VLIW media-processor, the latest of TriMedia processors, and describe the innovations with respect to its prede- cessor: the TM3260. We describe enhancements to the load/store unit design, such as a new data prefetching technique, and architectural enhancements

  17. Advanced Multiple Processor Configuration Study. Final Report.

    Science.gov (United States)

    Clymer, S. J.

    This summary of a study on multiple processor configurations includes the objectives, background, approach, and results of research undertaken to provide the Air Force with a generalized model of computer processor combinations for use in the evaluation of proposed flight training simulator computational designs. An analysis of a real-time flight…

  18. The Case for a Generic Implant Processor

    NARCIS (Netherlands)

    Strydis, C.; Gaydadjiev, G.N.

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably se

  19. An Empirical Evaluation of XQuery Processors

    NARCIS (Netherlands)

    Manegold, S.

    2008-01-01

    This paper presents an extensive and detailed experimental evaluation of XQuery processors. The study consists of running five publicly available XQuery benchmarks --- the Michigan benchmark (MBench), XBench, XMach-1, XMark and X007 --- on six XQuery processors, three stand-alone (file-based) XQuery

  20. The Case for a Generic Implant Processor

    NARCIS (Netherlands)

    Strydis, C.; Gaydadjiev, G.N.

    2008-01-01

    A more structured and streamlined design of implants is nowadays possible. In this paper we focus on implant processors located in the heart of implantable systems. We present a real and representative biomedical-application scenario where such a new processor can be employed. Based on a suitably

  1. Towards a Process Algebra for Shared Processors

    DEFF Research Database (Denmark)

    Buchholtz, Mikael; Andersen, Jacob; Løvengreen, Hans Henrik

    2002-01-01

    We present initial work on a timed process algebra that models sharing of processor resources allowing preemption at arbitrary points in time. This enables us to model both the functional and the timely behaviour of concurrent processes executed on a single processor. We give a refinement relation...

  2. Verilog Implementation of 32-Bit CISC Processor

    Directory of Open Access Journals (Sweden)

    P.Kanaka Sirisha

    2016-04-01

    Full Text Available The Project deals with the design of the 32-Bit CISC Processor and modeling of its components using Verilog language. The Entire Processor uses 32-Bit bus to deal with all the registers and the memories. This Processor implements various arithmetic, logical, Data Transfer operations etc., using variable length instructions, which is the core property of the CISC Architecture. The Processor also supports various addressing modes to perform a 32-Bit instruction. Our Processor uses Harvard Architecture (i.e., to have a separate program and data memory and hence has different buses to negotiate with the Program Memory and Data Memory individually. This feature enhances the speed of our processor. Hence it has two different Program Counters to point to the memory locations of the Program Memory and Data Memory.Our processor has ‘Instruction Queuing’ which enables it to save the time needed to fetch the instruction and hence increases the speed of operation. ‘Interrupt Service Routine’ is provided in our Processor to make it address the Interrupts.

  3. DFT algorithms for bit-serial GaAs array processor architectures

    Science.gov (United States)

    Mcmillan, Gary B.

    1988-01-01

    Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

  4. Monolithic Time Delay Integrated APD Arrays Project

    Data.gov (United States)

    National Aeronautics and Space Administration — The overall goal of the proposed program by Epitaxial Technologies is to develop monolithic time delay integrated avalanche photodiode (APD) arrays with sensitivity...

  5. Parallelization of Subchannel Analysis Code MATRA

    Energy Technology Data Exchange (ETDEWEB)

    Kim, Seongjin; Hwang, Daehyun; Kwon, Hyouk [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of)

    2014-05-15

    A stand-alone calculation of MATRA code used up pertinent computing time for the thermal margin calculations while a relatively considerable time is needed to solve the whole core pin-by-pin problems. In addition, it is strongly required to improve the computation speed of the MATRA code to satisfy the overall performance of the multi-physics coupling calculations. Therefore, a parallel approach to improve and optimize the computability of the MATRA code is proposed and verified in this study. The parallel algorithm is embodied in the MATRA code using the MPI communication method and the modification of the previous code structure was minimized. An improvement is confirmed by comparing the results between the single and multiple processor algorithms. The speedup and efficiency are also evaluated when increasing the number of processors. The parallel algorithm was implemented to the subchannel code MATRA using the MPI. The performance of the parallel algorithm was verified by comparing the results with those from the MATRA with the single processor. It is also noticed that the performance of the MATRA code was greatly improved by implementing the parallel algorithm for the 1/8 core and whole core problems.

  6. Equalizer: a scalable parallel rendering framework.

    Science.gov (United States)

    Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato

    2009-01-01

    Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.

  7. Asynchronous Parallelization of a CFD Solver

    Directory of Open Access Journals (Sweden)

    Daniel S. Abdi

    2015-01-01

    Full Text Available A Navier-Stokes equations solver is parallelized to run on a cluster of computers using the domain decomposition method. Two approaches of communication and computation are investigated, namely, synchronous and asynchronous methods. Asynchronous communication between subdomains is not commonly used in CFD codes; however, it has a potential to alleviate scaling bottlenecks incurred due to processors having to wait for each other at designated synchronization points. A common way to avoid this idle time is to overlap asynchronous communication with computation. For this to work, however, there must be something useful and independent a processor can do while waiting for messages to arrive. We investigate an alternative approach of computation, namely, conducting asynchronous iterations to improve local subdomain solution while communication is in progress. An in-house CFD code is parallelized using message passing interface (MPI, and scalability tests are conducted that suggest asynchronous iterations are a viable way of parallelizing CFD code.

  8. Parallelization of hydrocodes on the Intel hypercube

    Energy Technology Data Exchange (ETDEWEB)

    Hicks, D.L.; Liebrock, L.M.; Mousseau, V.A.; Mortensen, G.A.

    1987-04-01

    This report describes the Intel hydrocode parallelization project. The project consists of running four solution methods on two test problems. The code used is a modified version of one written for a project on the Heterogeneous Element Processor (HEP). In the course of running these test problems, timings were done to determine the optimal number of processors to use in solving each problem. The final conclusion drawn from this experiment is that while the hypercube architecture is appropriate for this class of problems the high communication overhead combined with too little memory makes the Intel hypercube a poor implementation target.

  9. Parallel Computing Methods For Particle Accelerator Design

    CERN Document Server

    Popescu, Diana Andreea; Hersch, Roger

    We present methods for parallelizing the transport map construction for multi-core processors and for Graphics Processing Units (GPUs). We provide an efficient implementation of the transport map construction. We describe a method for multi-core processors using the OpenMP framework which brings performance improvement over the serial version of the map construction. We developed a novel and efficient algorithm for multivariate polynomial multiplication for GPUs and we implemented it using the CUDA framework. We show the benefits of using the multivariate polynomial multiplication algorithm for GPUs in the map composition operation for high orders. Finally, we present an algorithm for map composition for GPUs.

  10. Implementing High Performance Lexical Analyzer using CELL Broadband Engine Processor

    Directory of Open Access Journals (Sweden)

    P.J.SATHISH KUMAR

    2011-09-01

    Full Text Available The lexical analyzer is the first phase of the compiler and commonly the most time consuming. The compilation of large programs is still far from optimized in today’s compilers. With modern processors moving more towards improving parallelization and multithreading, it has become impossible for performance gains in older compilersas technology advances. Any multicore architecture relies on improving parallelism than on improving single core performance. A compiler that is completely parallel and optimized is yet to be developed and would require significant effort to create. On careful analysis we find that the performance of a compiler is majorly affected by the lexical analyzer’s scanning and tokenizing phases. This effort is directed towards the creation of a completelyparallelized lexical analyzer designed to run on the Cell/B.E. processor that utilizes its multicore functionalities to achieve high performance gains in a compiler. Each SPE reads a block of data from the input and tokenizes them independently. To prevent dependence of SPE’s, a scheme for dynamically extending static block-limits isincorporated. Each SPE is given a range which it initially scans and then finalizes its input buffer to a set of complete tokens from the range dynamically. This ensures parallelization of the SPE’s independently and dynamically, with the PPE scheduling load for each SPE. The initially static assignment of the code blocks is made dynamic as soon as one SPE commits. This aids SPE load distribution and balancing. The PPE maintains the output buffer until all SPE’s of a single stage commit and move to the next stage before being written out to the file, to maintain order of execution. The approach can be extended easily to other multicore architectures as well. Tokenization is performed by high-speed string searching, with the keyword dictionary of the language, using Aho-Corasick algorithm.

  11. Monolithic multinozzle emitters for nanoelectrospray mass spectrometry

    Science.gov (United States)

    Wang, Daojing; Yang, Peidong; Kim, Woong; Fan, Rong

    2011-09-20

    Novel and significantly simplified procedures for fabrication of fully integrated nanoelectrospray emitters have been described. For nanofabricated monolithic multinozzle emitters (NM.sup.2 emitters), a bottom up approach using silicon nanowires on a silicon sliver is used. For microfabricated monolithic multinozzle emitters (M.sup.3 emitters), a top down approach using MEMS techniques on silicon wafers is used. The emitters have performance comparable to that of commercially-available silica capillary emitters for nanoelectrospray mass spectrometry.

  12. Parallel and Streaming Generation of Ghost Data for Structured Grids

    Energy Technology Data Exchange (ETDEWEB)

    Isenburg, M; Lindstrom, P; Childs, H

    2008-04-15

    Parallel simulations decompose large domains into many blocks. A fundamental requirement for subsequent parallel analysis and visualization is the presence of ghost data that supplements each block with a layer of adjacent data elements from neighboring blocks. The standard approach for generating ghost data requires all blocks to be in memory at once. This becomes impractical when there are fewer processors - and thus less aggregate memory - available for analysis than for simulation. We describe an algorithm for generating ghost data for structured grids that uses many fewer processors than previously possible. Our algorithm stores as little as one block per processor in memory and can run on as few processors as are available (possibly just one). The key idea is to slightly change the size of the original blocks by declaring parts of them to be ghost data, and by later padding adjacent blocks with this data.

  13. Effect of Thread Level Parallelism on the Performance of Optimum Architecture for Embedded Applications

    CERN Document Server

    Alipour, Mehdi

    2012-01-01

    According to the increasing complexity of network application and internet traffic, network processor as a subset of embedded processors have to process more computation intensive tasks. By scaling down the feature size and emersion of chip multiprocessors (CMP) that are usually multi-thread processors, the performance requirements are somehow guaranteed. As multithread processors are the heir of uni-thread processors and there isn't any general design flow to design a multithread embedded processor, in this paper we perform a comprehensive design space exploration for an optimum uni-thread embedded processor based on the limited area and power budgets. Finally we run multiple threads on this architecture to find out the maximum thread level parallelism (TLP) based on performance per power and area optimum uni-thread architecture.

  14. Automatic Parallelization Tool: Classification of Program Code for Parallel Computing

    Directory of Open Access Journals (Sweden)

    Mustafa Basthikodi

    2016-04-01

    Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.

  15. Activated Carbon Fiber Monoliths as Supercapacitor Electrodes

    Directory of Open Access Journals (Sweden)

    Gelines Moreno-Fernandez

    2017-01-01

    Full Text Available Activated carbon fibers (ACF are interesting candidates for electrodes in electrochemical energy storage devices; however, one major drawback for practical application is their low density. In the present work, monoliths were synthesized from two different ACFs, reaching 3 times higher densities than the original ACFs’ apparent densities. The porosity of the monoliths was only slightly decreased with respect to the pristine ACFs, the employed PVDC binder developing additional porosity upon carbonization. The ACF monoliths are essentially microporous and reach BET surface areas of up to 1838 m2 g−1. SEM analysis reveals that the ACFs are well embedded into the monolith structure and that their length was significantly reduced due to the monolith preparation process. The carbonized monoliths were studied as supercapacitor electrodes in two- and three-electrode cells having 2 M H2SO4 as electrolyte. Maximum capacitances of around 200 F g−1 were reached. The results confirm that the capacitance of the bisulfate anions essentially originates from the double layer, while hydronium cations contribute with a mixture of both, double layer capacitance and pseudocapacitance.

  16. Monolithic integration of erbium-doped amplifiers with silicon-on-insulator waveguides.

    Science.gov (United States)

    Agazzi, Laura; Bradley, Jonathan D B; Dijkstra, Meindert; Ay, Feridun; Roelkens, Gunther; Baets, Roel; Wörhoff, Kerstin; Pollnau, Markus

    2010-12-20

    Monolithic integration of Al2O3:Er3+ amplifier technology with passive silicon-on-insulator waveguides is demonstrated. A signal enhancement of >7 dB at 1533 nm wavelength is obtained. The straightforward wafer-scale fabrication process, which includes reactive co-sputtering and subsequent reactive ion etching, allows for parallel integration of multiple amplifier and laser sections with silicon or other photonic circuits on a chip.

  17. Combined Scheduling and Mapping for Scalable Computing with Parallel Tasks

    Directory of Open Access Journals (Sweden)

    Jörg Dümmler

    2012-01-01

    Full Text Available Recent and future parallel clusters and supercomputers use symmetric multiprocessors (SMPs and multi-core processors as basic nodes, providing a huge amount of parallel resources. These systems often have hierarchically structured interconnection networks combining computing resources at different levels, starting with the interconnect within multi-core processors up to the interconnection network combining nodes of the cluster or supercomputer. The challenge for the programmer is that these computing resources should be utilized efficiently by exploiting the available degree of parallelism of the application program and by structuring the application in a way which is sensitive to the heterogeneous interconnect. In this article, we pursue a parallel programming method using parallel tasks to structure parallel implementations. A parallel task can be executed by multiple processors or cores and, for each activation of a parallel task, the actual number of executing cores can be adapted to the specific execution situation. In particular, we propose a new combined scheduling and mapping technique for parallel tasks with dependencies that takes the hierarchical structure of modern multi-core clusters into account. An experimental evaluation shows that the presented programming approach can lead to a significantly higher performance compared to standard data parallel implementations.

  18. Research in Parallel Algorithms and Software for Computational Aerosciences

    Science.gov (United States)

    Domel, Neal D.

    1996-01-01

    Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.

  19. An "artificial retina" processor for track reconstruction at the full LHC crossing rate

    Science.gov (United States)

    Abba, A.; Bedeschi, F.; Caponio, F.; Cenci, R.; Citterio, M.; Cusimano, A.; Fu, J.; Geraci, A.; Grizzuti, M.; Lusardi, N.; Marino, P.; Morello, M. J.; Neri, N.; Ninci, D.; Petruzzo, M.; Piucci, A.; Punzi, G.; Ristori, L.; Spinella, F.; Stracka, S.; Tonelli, D.; Walsh, J.

    2016-07-01

    We present the latest results of an R&D study for a specialized processor capable of reconstructing, in a silicon pixel detector, high-quality tracks from high-energy collision events at 40 MHz. The processor applies a highly parallel pattern-recognition algorithm inspired to quick detection of edges in mammals visual cortex. After a detailed study of a real-detector application, demonstrating that online reconstruction of offline-quality tracks is feasible at 40 MHz with sub-microsecond latency, we are implementing a prototype using common high-bandwidth FPGA devices.

  20. An “artificial retina” processor for track reconstruction at the full LHC crossing rate

    Energy Technology Data Exchange (ETDEWEB)

    Abba, A. [Istituto Nazionale di Fisica Nucleare – Sez. di Milano, Milano (Italy); Politecnico di Milano, Milano (Italy); Bedeschi, F. [Istituto Nazionale di Fisica Nucleare – Sez. di Pisa, Pisa (Italy); Caponio, F. [Istituto Nazionale di Fisica Nucleare – Sez. di Milano, Milano (Italy); Politecnico di Milano, Milano (Italy); Cenci, R., E-mail: riccardo.cenci@pi.infn.it [Istituto Nazionale di Fisica Nucleare – Sez. di Pisa, Pisa (Italy); Scuola Normale Superiore, Pisa (Italy); Citterio, M. [Istituto Nazionale di Fisica Nucleare – Sez. di Milano, Milano (Italy); Cusimano, A. [Istituto Nazionale di Fisica Nucleare – Sez. di Milano, Milano (Italy); Politecnico di Milano, Milano (Italy); Fu, J. [Istituto Nazionale di Fisica Nucleare – Sez. di Milano, Milano (Italy); Geraci, A.; Grizzuti, M.; Lusardi, N. [Istituto Nazionale di Fisica Nucleare – Sez. di Milano, Milano (Italy); Politecnico di Milano, Milano (Italy); Marino, P.; Morello, M.J. [Istituto Nazionale di Fisica Nucleare – Sez. di Pisa, Pisa (Italy); Scuola Normale Superiore, Pisa (Italy); Neri, N. [Istituto Nazionale di Fisica Nucleare – Sez. di Milano, Milano (Italy); Ninci, D. [Istituto Nazionale di Fisica Nucleare – Sez. di Pisa, Pisa (Italy); Università di Pisa, Pisa (Italy); Petruzzo, M. [Istituto Nazionale di Fisica Nucleare – Sez. di Milano, Milano (Italy); Università di Milano, Milano (Italy); Piucci, A.; Punzi, G. [Istituto Nazionale di Fisica Nucleare – Sez. di Pisa, Pisa (Italy); Università di Pisa, Pisa (Italy); Ristori, L. [Fermi National Accelerator Laboratory, Batavia, IL (United States); Spinella, F. [Istituto Nazionale di Fisica Nucleare – Sez. di Pisa, Pisa (Italy); and others

    2016-07-11

    We present the latest results of an R&D study for a specialized processor capable of reconstructing, in a silicon pixel detector, high-quality tracks from high-energy collision events at 40 MHz. The processor applies a highly parallel pattern-recognition algorithm inspired to quick detection of edges in mammals visual cortex. After a detailed study of a real-detector application, demonstrating that online reconstruction of offline-quality tracks is feasible at 40 MHz with sub-microsecond latency, we are implementing a prototype using common high-bandwidth FPGA devices.

  1. Massively parallel computational fluid dynamics calculations for aerodynamics and aerothermodynamics applications

    Energy Technology Data Exchange (ETDEWEB)

    Payne, J.L.; Hassan, B.

    1998-09-01

    Massively parallel computers have enabled the analyst to solve complicated flow fields (turbulent, chemically reacting) that were previously intractable. Calculations are presented using a massively parallel CFD code called SACCARA (Sandia Advanced Code for Compressible Aerothermodynamics Research and Analysis) currently under development at Sandia National Laboratories as part of the Department of Energy (DOE) Accelerated Strategic Computing Initiative (ASCI). Computations were made on a generic reentry vehicle in a hypersonic flowfield utilizing three different distributed parallel computers to assess the parallel efficiency of the code with increasing numbers of processors. The parallel efficiencies for the SACCARA code will be presented for cases using 1, 150, 100 and 500 processors. Computations were also made on a subsonic/transonic vehicle using both 236 and 521 processors on a grid containing approximately 14.7 million grid points. Ongoing and future plans to implement a parallel overset grid capability and couple SACCARA with other mechanics codes in a massively parallel environment are discussed.

  2. Enabling Future Robotic Missions with Multicore Processors

    Science.gov (United States)

    Powell, Wesley A.; Johnson, Michael A.; Wilmot, Jonathan; Some, Raphael; Gostelow, Kim P.; Reeves, Glenn; Doyle, Richard J.

    2011-01-01

    Recent commercial developments in multicore processors (e.g. Tilera, Clearspeed, HyperX) have provided an option for high performance embedded computing that rivals the performance attainable with FPGA-based reconfigurable computing architectures. Furthermore, these processors offer more straightforward and streamlined application development by allowing the use of conventional programming languages and software tools in lieu of hardware design languages such as VHDL and Verilog. With these advantages, multicore processors can significantly enhance the capabilities of future robotic space missions. This paper will discuss these benefits, along with onboard processing applications where multicore processing can offer advantages over existing or competing approaches. This paper will also discuss the key artchitecural features of current commercial multicore processors. In comparison to the current art, the features and advancements necessary for spaceflight multicore processors will be identified. These include power reduction, radiation hardening, inherent fault tolerance, and support for common spacecraft bus interfaces. Lastly, this paper will explore how multicore processors might evolve with advances in electronics technology and how avionics architectures might evolve once multicore processors are inserted into NASA robotic spacecraft.

  3. Two Parallel Swendsen-Wang Cluster Algorithms Using Message-Passing Paradigm

    CERN Document Server

    Lin, Shizeng

    2008-01-01

    In this article, we present two different parallel Swendsen-Wang Cluster(SWC) algorithms using message-passing interface(MPI). One is based on Master-Slave Parallel Model(MSPM) and the other is based on Data-Parallel Model(DPM). A speedup of 24 with 40 processors and 16 with 37 processors is achieved with the DPM and MSPM respectively. The speedup of both algorithms at different temperature and system size is carefully examined both experimentally and theoretically, and a comparison of their efficiency is made. In the last section, based on these two parallel SWC algorithms, two parallel probability changing cluster(PCC) algorithms are proposed.

  4. Mathematical model partitioning and packing for parallel computer calculation

    Science.gov (United States)

    Arpasi, Dale J.; Milner, Edward J.

    1986-01-01

    This paper deals with the development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system. The identification of computational parallelism within the model equations is discussed. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. Next, an algorithm which packs the equations into a minimum number of processors is described. The results of applying the packing algorithm to a turboshaft engine model are presented.

  5. A Parallel Algorithm for the Vehicle Routing Problem

    Energy Technology Data Exchange (ETDEWEB)

    Groer, Christopher S [ORNL; Golden, Bruce [University of Maryland; Edward, Wasil [American University

    2011-01-01

    The vehicle routing problem (VRP) is a dicult and well-studied combinatorial optimization problem. We develop a parallel algorithm for the VRP that combines a heuristic local search improvement procedure with integer programming. We run our parallel algorithm with as many as 129 processors and are able to quickly nd high-quality solutions to standard benchmark problems. We assess the impact of parallelism by analyzing our procedure's performance under a number of dierent scenarios.

  6. Making CSB+-Tree Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...

  7. Processor arrays with asynchronous TDM optical buses

    Science.gov (United States)

    Li, Y.; Zheng, S. Q.

    1997-04-01

    We propose a pipelined asynchronous time division multiplexing optical bus. Such a bus can use one of the two hardwared priority schemes, the linear priority scheme and the round-robin priority scheme. Our simulation results show that the performances of our proposed buses are significantly better than the performances of known pipelined synchronous time division multiplexing optical buses. We also propose a class of processor arrays connected by pipelined asynchronous time division multiplexing optical buses. We claim that our proposed processor array not only have better performance, but also have better scalabilities than the existing processor arrays connected by pipelined synchronous time division multiplexing optical buses.

  8. UAV Swarm Mission Planning Development Using Evolutionary Algorithms and Parallel Simulation - Part II

    Science.gov (United States)

    2008-05-01

    Reynolds’ Distributed Behavior Model [18]. Corner [5,6,20] ported the model from a single-processor Windows platform to a parallel Linux-based Beowulf ... Beowulf clusters consist of 1 to 16 processors and 1 to 1024 problems in intervals of powers of two. Each test is run 30 times for statistical analysis

  9. Monolithically integrated Ge CMOS laser

    Science.gov (United States)

    Camacho-Aguilera, Rodolfo

    2014-02-01

    Ge-on-Si devices are explored for photonic integration. Through the development of better growth techniques, monolithic integration, laser design and prototypes, it was possible to probe Ge light emitters with emphasis on lasers. Preliminary worked shows thermal photonic behavior capable of enhancing lamination at high temperatures. Increase luminescence is observed up to 120°C from L-band contribution. Higher temperatures show contribution from Δ -band. The increase carrier thermal contribution suggests high temperature applications for Ge light emitters. A Ge electrically pumped laser was probed under 0.2% biaxial strain and doping concentration ~4.5×1019cm-3 n-type. Ge pnn lasers exhibit a gain >1000cm-1 with 8mW power output, presenting a spectrum range of over 200nm, making Ge the ideal candidate for Si photonics. Large temperatures fluctuations and process limit the present device. Theoretically a gain of >4000cm- gain is possible with a threshold of as low as 1kA/cm2. Improvements in Ge work

  10. Design concepts for a virtualizable embedded MPSoC architecture enabling virtualization in embedded multi-processor systems

    CERN Document Server

    Biedermann, Alexander

    2014-01-01

    Alexander Biedermann presents a generic hardware-based virtualization approach, which may transform an array of any off-the-shelf embedded processors into a multi-processor system with high execution dynamism. Based on this approach, he highlights concepts for the design of energy aware systems, self-healing systems as well as parallelized systems. For the latter, the novel so-called Agile Processing scheme is introduced by the author, which enables a seamless transition between sequential and parallel execution schemes. The design of such virtualizable systems is further aided by introduction

  11. Parallel algorithms

    CERN Document Server

    Casanova, Henri; Robert, Yves

    2008-01-01

    ""…The authors of the present book, who have extensive credentials in both research and instruction in the area of parallelism, present a sound, principled treatment of parallel algorithms. … This book is very well written and extremely well designed from an instructional point of view. … The authors have created an instructive and fascinating text. The book will serve researchers as well as instructors who need a solid, readable text for a course on parallelism in computing. Indeed, for anyone who wants an understandable text from which to acquire a current, rigorous, and broad vi

  12. Fast parallel event reconstruction

    CERN Document Server

    CERN. Geneva

    2010-01-01

    On-line processing of large data volumes produced in modern HEP experiments requires using maximum capabilities of modern and future many-core CPU and GPU architectures.One of such powerful feature is a SIMD instruction set, which allows packing several data items in one register and to operate on all of them, thus achievingmore operations per clock cycle. Motivated by the idea of using the SIMD unit ofmodern processors, the KF based track fit has been adapted for parallelism, including memory optimization, numerical analysis, vectorization with inline operator overloading, and optimization using SDKs. The speed of the algorithm has been increased in 120000 times with 0.1 ms/track, running in parallel on 16 SPEs of a Cell Blade computer.  Running on a Nehalem CPU with 8 cores it shows the processing speed of 52 ns/track using the Intel Threading Building Blocks. The same KF algorithm running on an Nvidia GTX 280 in the CUDA frameworkprovi...

  13. Research and Design of Monolithic Decision Circuit for Optical Communication System①②

    Institute of Scientific and Technical Information of China (English)

    ZHANGYaqi; ZHAOJie

    1997-01-01

    In this paper,the cause of bit-error is analyzed when data are decided in the optical receiver.A monolithic D-ff decision circuit is designed.It can work effectively at 622 Mb/s.Moreover,a decision method of parallel processing to improve thd decision speed is presented,through which the parallel circuit can work up to 1 Gb/s using the same model.With the technique,higher-speed data can be decided by using lower speed device.

  14. The science of computing - Parallel computation

    Science.gov (United States)

    Denning, P. J.

    1985-01-01

    Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.

  15. A Parallel 3D Spectral Difference Method for Solutions of Compressible Navier Stokes Equations on Deforming Grids and Simulations of Vortex Induced Vibration

    Science.gov (United States)

    DeJong, Andrew

    Numerical models of fluid-structure interaction have grown in importance due to increasing interest in environmental energy harvesting, airfoil-gust interactions, and bio-inspired formation flying. Powered by increasingly powerful parallel computers, such models seek to explain the fundamental physics behind the complex, unsteady fluid-structure phenomena. To this end, a high-fidelity computational model based on the high-order spectral difference method on 3D unstructured, dynamic meshes has been developed. The spectral difference method constructs continuous solution fields within each element with a Riemann solver to compute the inviscid fluxes at the element interfaces and an averaging mechanism to compute the viscous fluxes. This method has shown promise in the past as a highly accurate, yet sufficiently fast method for solving unsteady viscous compressible flows. The solver is monolithically coupled to the equations of motion of an elastically mounted 3-degree of freedom rigid bluff body undergoing flow-induced lift, drag, and torque. The mesh is deformed using 4 methods: an analytic function, Laplace equation, biharmonic equation, and a bi-elliptic equation with variable diffusivity. This single system of equations -- fluid and structure -- is advanced through time using a 5-stage, 4th-order Runge-Kutta scheme. Message Passing Interface is used to run the coupled system in parallel on up to 240 processors. The solver is validated against previously published numerical and experimental data for an elastically mounted cylinder. The effect of adding an upstream body and inducing wake galloping is observed.

  16. Mixed-mode implementation of PETSc for scalable linear algebra on multi-core processors

    CERN Document Server

    Weiland, Michele; Gorman, Gerard; Kramer, Stephan; Parsons, Mark; Southern, James

    2012-01-01

    With multi-core processors a ubiquitous building block of modern supercomputers, it is now past time to enable applications to embrace these developments in processor design. To achieve exascale performance, applications will need ways of exploiting the new levels of parallelism that are exposed in modern high-performance computers. A typical approach to this is to use shared-memory programming techniques to best exploit multi-core nodes along with inter-node message passing. In this paper, we describe the addition of OpenMP threaded functionality to the PETSc library. We highlight some issues that hinder good performance of threaded applications on modern processors and describe how to negate them. The OpenMP branch of PETSc was benchmarked using matrices extracted from the Fluidity CFD application, which uses the library as its linear solver engine. The overall performance of the mixed-mode implementation is shown to be superior to that of the pure-MPI version.

  17. XOP: a second generation fast processor for on-line use in high energy physics experiments

    CERN Document Server

    Lingjaerde, Tor

    1981-01-01

    Processors for trigger calculations and data compression in high energy physics are characterized by a high data input capability combined with fast execution of relatively simple routines. In order to achieve the required performance it is advantageous to replace the classical computer instruction-set by microcoded instructions, the various fields of which control the internal subunits in parallel. The fast processor called ESOP is based on such a principle: the different operations are handled step by step by dedicated optimized modules under control of a central instruction unit. Thus, the arithmetic operations, address calculations, conditional checking, loop counts and next instruction evaluation all overlap in time. Based upon the experience from ESOP the architecture of a new processor "XOP" is beginning to take shape which will be faster and easier to use. In this context the most important innovations are: easy handling of operands in the arithmetic unit by means of three data buses and large data fi...

  18. The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture

    CERN Document Server

    Biagioni, Andrea; Lonardo, Alessandro; Paolucci, Pier Stanislao; Perra, Mersia; Rossetti, Davide; Sidore, Carlo; Simula, Francesco; Tosoratto, Laura; Vicini, Piero

    2012-01-01

    One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip communications between processors, recent multi-tile (i.e. multi-core) architectures face the challenge for an efficient on-chip interconnection network between processor's tiles. In this paper, we present a configurable and scalable architecture, based on our Distributed Network Processor (DNP) IP Library, targeting systems ranging from single MPSoCs to massive HPC platforms. The DNP provides inter-tile services for both on-chip and off-chip communications with a uniform RDMA style API, over a multi-dimensional direct network with a (possibly) hybrid topology.

  19. A High Performance Router Based on Network Processor and Ternary—CAM

    Institute of Scientific and Technical Information of China (English)

    WANGZhiheng; LIZhenwu; BAIYingcai

    2003-01-01

    Network processors are designed to for-ward IP packets at wire-speed, with the advantage (over ASIC-based solutions) of being programmable. TCAM(ternary content-addressable memory) is a perfect hard-ware realization for fully parallel search. The ternary ca-pability can be used to determine longest prefix matches.This paper addresses how to use IXP (Internet eXchange processor), a kind of network processor, and TCAM to design a high performance router. We also present a ta-ble management algorithm to decrease the incremental up-date time of TCAM and improve its utilization. Then we analyse the performance of our prototype and demonstrate that it can easily sustain wire-speed for eight 100Mbps and two 1000Mbps Ethernet ports.

  20. Programming Massively Parallel Architectures using MARTE: a Case Study

    CERN Document Server

    Rodrigues, Wendell; Dekeyser, Jean-Luc

    2011-01-01

    Nowadays, several industrial applications are being ported to parallel architectures. These applications take advantage of the potential parallelism provided by multiple core processors. Many-core processors, especially the GPUs(Graphics Processing Unit), have led the race of floating-point performance since 2003. While the performance improvement of general- purpose microprocessors has slowed significantly, the GPUs have continued to improve relentlessly. As of 2009, the ratio between many-core GPUs and multicore CPUs for peak floating-point calculation throughput is about 10 times. However, as parallel programming requires a non-trivial distribution of tasks and data, developers find it hard to implement their applications effectively. Aiming to improve the use of many-core processors, this work presents an case-study using UML and MARTE profile to specify and generate OpenCL code for intensive signal processing applications. Benchmark results show us the viability of the use of MDE approaches to generate G...

  1. A hybrid algorithm for parallel molecular dynamics simulations

    CERN Document Server

    Mangiardi, Chris M

    2016-01-01

    This article describes an algorithm for hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-ranged forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with AVX and AVX-2 processors as well as Xeon-Phi co-processors.

  2. A High-Performance Communication Service for Parallel Servo Computing

    Directory of Open Access Journals (Sweden)

    Cheng Xin

    2010-11-01

    Full Text Available Complexity of algorithms for the servo control in the multi-dimensional, ultra-precise stage application has made multi-processor parallel computing technology needed. Considering the specific communication requirements in the parallel servo computing, we propose a communication service scheme based on VME bus, which provides high-performance data transmission and precise synchronization trigger support for the processors involved. Communications service is implemented on both standard VME bus and user-defined Internal Bus (IB, and can be redefined online. This paper introduces parallel servo computing architecture and communication service, describes structure and implementation details of each module in the service, and finally provides data transmission model and analysis. Experimental results show that communication services can provide high-speed data transmission with sub-nanosecond-level error of transmission latency, and synchronous trigger with nanosecond-level synchronization error. Moreover, the performance of communication service is not affected by the increasing number of processors.

  3. A hybrid algorithm for parallel molecular dynamics simulations

    Science.gov (United States)

    Mangiardi, Chris M.; Meyer, R.

    2017-10-01

    This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.

  4. Photonics and Fiber Optics Processor Lab

    Data.gov (United States)

    Federal Laboratory Consortium — The Photonics and Fiber Optics Processor Lab develops, tests and evaluates high speed fiber optic network components as well as network protocols. In addition, this...

  5. Radiation Tolerant Software Defined Video Processor Project

    Data.gov (United States)

    National Aeronautics and Space Administration — MaXentric's is proposing a radiation tolerant Software Define Video Processor, codenamed SDVP, for the problem of advanced motion imaging in the space environment....

  6. Processor-Dependent Malware... and codes

    CERN Document Server

    Desnos, Anthony; Filiol, Eric

    2010-01-01

    Malware usually target computers according to their operating system. Thus we have Windows malwares, Linux malwares and so on ... In this paper, we consider a different approach and show on a technical basis how easily malware can recognize and target systems selectively, according to the onboard processor chip. This technology is very easy to build since it does not rely on deep analysis of chip logical gates architecture. Floating Point Arithmetic (FPA) looks promising to define a set of tests to identify the processor or, more precisely, a subset of possible processors. We give results for different families of processors: AMD, Intel (Dual Core, Atom), Sparc, Digital Alpha, Cell, Atom ... As a conclusion, we propose two {\\it open problems} that are new, to the authors' knowledge.

  7. Parallel biocomputing

    Directory of Open Access Journals (Sweden)

    Witte John S

    2011-03-01

    Full Text Available Abstract Background With the advent of high throughput genomics and high-resolution imaging techniques, there is a growing necessity in biology and medicine for parallel computing, and with the low cost of computing, it is now cost-effective for even small labs or individuals to build their own personal computation cluster. Methods Here we briefly describe how to use commodity hardware to build a low-cost, high-performance compute cluster, and provide an in-depth example and sample code for parallel execution of R jobs using MOSIX, a mature extension of the Linux kernel for parallel computing. A similar process can be used with other cluster platform software. Results As a statistical genetics example, we use our cluster to run a simulated eQTL experiment. Because eQTL is computationally intensive, and is conceptually easy to parallelize, like many statistics/genetics applications, parallel execution with MOSIX gives a linear speedup in analysis time with little additional effort. Conclusions We have used MOSIX to run a wide variety of software programs in parallel with good results. The limitations and benefits of using MOSIX are discussed and compared to other platforms.

  8. A two-level parallel direct search implementation for arbitrarily sized objective functions

    Energy Technology Data Exchange (ETDEWEB)

    Hutchinson, S.A.; Shadid, N.; Moffat, H.K. [Sandia National Labs., Albuquerque, NM (United States)] [and others

    1994-12-31

    In the past, many optimization schemes for massively parallel computers have attempted to achieve parallel efficiency using one of two methods. In the case of large and expensive objective function calculations, the optimization itself may be run in serial and the objective function calculations parallelized. In contrast, if the objective function calculations are relatively inexpensive and can be performed on a single processor, then the actual optimization routine itself may be parallelized. In this paper, a scheme based upon the Parallel Direct Search (PDS) technique is presented which allows the objective function calculations to be done on an arbitrarily large number (p{sub 2}) of processors. If, p, the number of processors available, is greater than or equal to 2p{sub 2} then the optimization may be parallelized as well. This allows for efficient use of computational resources since the objective function calculations can be performed on the number of processors that allow for peak parallel efficiency and then further speedup may be achieved by parallelizing the optimization. Results are presented for an optimization problem which involves the solution of a PDE using a finite-element algorithm as part of the objective function calculation. The optimum number of processors for the finite-element calculations is less than p/2. Thus, the PDS method is also parallelized. Performance comparisons are given for a nCUBE 2 implementation.

  9. Critical review of programmable media processor architectures

    Science.gov (United States)

    Berg, Stefan G.; Sun, Weiyun; Kim, Donglok; Kim, Yongmin

    1998-12-01

    In the past several years, there has been a surge of new programmable mediaprocessors introduced to provide an alternative solution to ASICs and dedicated hardware circuitries in the multimedia PC and embedded consumer electronics markets. These processors attempt to combine the programmability of multimedia-enhanced general purpose processors with the performance and low cost of dedicated hardware. We have reviewed five current multimedia architectures and evaluated their strengths and weaknesses.

  10. First Cluster Algorithm Special Purpose Processor

    Science.gov (United States)

    Talapov, A. L.; Andreichenko, V. B.; Dotsenko S., Vi.; Shchur, L. N.

    We describe the architecture of the special purpose processor built to realize in hardware cluster Wolff algorithm, which is not hampered by a critical slowing down. The processor simulates two-dimensional Ising-like spin systems. With minor changes the same very effective architecture, which can be defined as a Memory Machine, can be used to study phase transitions in a wide range of models in two or three dimensions.

  11. A New Echeloned Poisson Series Processor (EPSP)

    Science.gov (United States)

    Ivanova, Tamara

    2001-07-01

    A specialized Echeloned Poisson Series Processor (EPSP) is proposed. It is a typical software for the implementation of analytical algorithms of Celestial Mechanics. EPSP is designed for manipulating long polynomial-trigonometric series with literal divisors. The coefficients of these echeloned series are the rational or floating-point numbers. The Keplerian processor and analytical generator of special celestial mechanics functions based on the EPSP are also developed.

  12. Evaluating current processors performance and machines stability

    CERN Document Server

    Esposito, R; Tortone, G; Taurino, F M

    2003-01-01

    Accurately estimate performance of currently available processors is becoming a key activity, particularly in HENP environment, where high computing power is crucial. This document describes the methods and programs, opensource or freeware, used to benchmark processors, memory and disk subsystems and network connection architectures. These tools are also useful to stress test new machines, before their acquisition or before their introduction in a production environment, where high uptimes are requested.

  13. SPRINT: A new parallel framework for R

    Directory of Open Access Journals (Sweden)

    Scharinger Florian

    2008-12-01

    Full Text Available Abstract Background Microarray analysis allows the simultaneous measurement of thousands to millions of genes or sequences across tens to thousands of different samples. The analysis of the resulting data tests the limits of existing bioinformatics computing infrastructure. A solution to this issue is to use High Performance Computing (HPC systems, which contain many processors and more memory than desktop computer systems. Many biostatisticians use R to process the data gleaned from microarray analysis and there is even a dedicated group of packages, Bioconductor, for this purpose. However, to exploit HPC systems, R must be able to utilise the multiple processors available on these systems. There are existing modules that enable R to use multiple processors, but these are either difficult to use for the HPC novice or cannot be used to solve certain classes of problems. A method of exploiting HPC systems, using R, but without recourse to mastering parallel programming paradigms is therefore necessary to analyse genomic data to its fullest. Results We have designed and built a prototype framework that allows the addition of parallelised functions to R to enable the easy exploitation of HPC systems. The Simple Parallel R INTerface (SPRINT is a wrapper around such parallelised functions. Their use requires very little modification to existing sequential R scripts and no expertise in parallel computing. As an example we created a function that carries out the computation of a pairwise calculated correlation matrix. This performs well with SPRINT. When executed using SPRINT on an HPC resource of eight processors this computation reduces by more than three times the time R takes to complete it on one processor. Conclusion SPRINT allows the biostatistician to concentrate on the research problems rather than the computation, while still allowing exploitation of HPC systems. It is easy to use and with further development will become more useful as more

  14. PDDP: A data parallel programming model. Revision 1

    Energy Technology Data Exchange (ETDEWEB)

    Warren, K.H.

    1995-06-01

    PDDP, the Parallel Data Distribution Preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP impelments High Performance Fortran compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the (WRERE?) construct. Distribued data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared-memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

  15. SMART AS A CRYPTOGRAPHIC PROCESSOR

    Directory of Open Access Journals (Sweden)

    Saroja Kanchi

    2016-05-01

    Full Text Available SMaRT is a 16-bit 2.5-address RISC-type single-cycle processor, which was recently designed and successfully mapped into a FPGA chip in our ECE department. In this paper, we use SMaRT to run the well-known encryption algorithm, Data Encryption Standard. For information security purposes, encryption is a must in today’s sophisticated and ever-increasing computer communications such as ATM machines and SIM cards. For comparison and evaluation purposes, we also map the same algorithm on the HC12, a same-size but CISC-type off-the-shelf microcontroller, Our results show that compared to HC12, SMaRT code is only 14% longer in terms of the static number of instructions but about 10 times faster in terms of the number of clock cycles, and 7% smaller in terms of code size. Our results also show that 2.5- address instructions, a SMaRT selling point, amount to 45% of the whole R-type instructions resulting in significant improvement in static number of instructions hence code size as well as performance. Additionally, we see that the SMaRT short-branch range is sufficiently wide in 90% of cases in the SMaRT code. Our results also reveal that the SMaRT novel concept of locality of reference in using the MSBs of the registers in non-subroutine branch instructions stays valid with a remarkable hit rate of 95%!

  16. Plasmonics and the parallel programming problem

    Science.gov (United States)

    Vishkin, Uzi; Smolyaninov, Igor; Davis, Chris

    2007-02-01

    While many parallel computers have been built, it has generally been too difficult to program them. Now, all computers are effectively becoming parallel machines. Biannual doubling in the number of cores on a single chip, or faster, over the coming decade is planned by most computer vendors. Thus, the parallel programming problem is becoming more critical. The only known solution to the parallel programming problem in the theory of computer science is through a parallel algorithmic theory called PRAM. Unfortunately, some of the PRAM theory assumptions regarding the bandwidth between processors and memories did not properly reflect a parallel computer that could be built in previous decades. Reaching memories, or other processors in a multi-processor organization, required off-chip connections through pins on the boundary of each electric chip. Using the number of transistors that is becoming available on chip, on-chip architectures that adequately support the PRAM are becoming possible. However, the bandwidth of off-chip connections remains insufficient and the latency remains too high. This creates a bottleneck at the boundary of the chip for a PRAM-On-Chip architecture. This also prevents scalability to larger "supercomputing" organizations spanning across many processing chips that can handle massive amounts of data. Instead of connections through pins and wires, power-efficient CMOS-compatible on-chip conversion to plasmonic nanowaveguides is introduced for improved latency and bandwidth. Proper incorporation of our ideas offer exciting avenues to resolving the parallel programming problem, and an alternative way for building faster, more useable and much more compact supercomputers.

  17. Parallel transposition of sparse data structures

    DEFF Research Database (Denmark)

    Wang, Hao; Liu, Weifeng; Hou, Kaixi

    2016-01-01

    Many applications in computational sciences and social sciences exploit sparsity and connectivity of acquired data. Even though many parallel sparse primitives such as sparse matrix-vector (SpMV) multiplication have been extensively studied, some other important building blocks, e.g., parallel...... transposition for sparse matrices and graphs, have not received the attention they deserve. In this paper, we first identify that the transposition operation can be a bottleneck of some fundamental sparse matrix and graph algorithms. Then, we revisit the performance and scalability of parallel transposition...... approaches on x86-based multi-core and many-core processors. Based on the insights obtained, we propose two new parallel transposition algorithms: ScanTrans and MergeTrans. The experimental results show that our ScanTrans method achieves an average of 2.8-fold (up to 6.2-fold) speedup over the parallel...

  18. Genetic Parallel Programming: design and implementation.

    Science.gov (United States)

    Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong

    2006-01-01

    This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.

  19. Modified monolithic silica capillary for preconcentration of catecholamines

    Institute of Scientific and Technical Information of China (English)

    Wei Chang; Tusyo-shi Komazu

    2009-01-01

    Preconcentration of catecholamines by the modified monolithic silica in the capillary was investigated in this study. In order to achieve a microchip-based method for determining catecholamines in the saliva, the monolithic silica was fabricated in the capillary and the monolithic silica was chemically modified by on-column reaction with phenylboronate. Different modified methods were compared. The concentration conditions were optimized. This study indicates the applicability of the modified monolithic silica capillary when it was used to concentrate catecholamines.

  20. Modified monolithic silica capillary for preconcentration of catecholamines

    Institute of Scientific and Technical Information of China (English)

    2009-01-01

    Preconcentration of catecholamines by the modified monolithic silica in the capillary was investigated in this study. In order to achieve a microchip-based method for determining catecholamines in the saliva,the monolithic silica was fabricated in the capillary and the monolithic silica was chemically modified by on-column reaction with phenylboronate. Different modified methods were compared. The concentration conditions were optimized. This study indicates the applicability of the modified monolithic sili...

  1. Fracture resistance of monolithic zirconia molar crowns with reduced thickness

    OpenAIRE

    Nakamura, Keisuke; Harada, A.; Inagaki, R.; Kanno, Taro; Niwano, Y; Milleding, Percy; Ørtengren, Ulf Thore

    2015-01-01

    This is the accepted manuscript version. Published version is available at Acta Odontologica Scandinavica Objectives. The purpose of the present study was to analyze the relationship between fracture load of monolithic zirconia crowns and axial/occlusal thickness, and to evaluate the fracture resistance of monolithic zirconia crowns with reduced thickness in comparison with that of monolithic lithium disilicate crowns with regular thickness. Materials and methods. Monolithic zi...

  2. DiFX: A software correlator for very long baseline interferometry using multi-processor computing environments

    CERN Document Server

    Deller, A T; Bailes, M; West, C

    2007-01-01

    We describe the development of an FX style correlator for Very Long Baseline Interferometry (VLBI), implemented in software and intended to run in multi-processor computing environments, such as large clusters of commodity machines (Beowulf clusters) or computers specifically designed for high performance computing, such as multi-processor shared-memory machines. We outline the scientific and practical benefits for VLBI correlation, these chiefly being due to the inherent flexibility of software and the fact that the highly parallel and scalable nature of the correlation task is well suited to a multi-processor computing environment. We suggest scientific applications where such an approach to VLBI correlation is most suited and will give the best returns. We report detailed results from the Distributed FX (DiFX) software correlator, running on the Swinburne supercomputer (a Beowulf cluster of approximately 300 commodity processors), including measures of the performance of the system. For example, to correla...

  3. Development of a monolithic ferrite memory array

    Science.gov (United States)

    Heckler, C. H., Jr.; Bhiwandker, N. C.

    1972-01-01

    The results of the development and testing of ferrite monolithic memory arrays are presented. This development required the synthesis of ferrite materials having special magnetic and physical characteristics and the development of special processes; (1) for making flexible sheets (laminae) of the ferrite composition, (2) for embedding conductors in ferrite, and (3) bonding ferrite laminae together to form a monolithic structure. Major problems encountered in each of these areas and their solutions are discussed. Twenty-two full-size arrays were fabricated and fired during the development of these processes. The majority of these arrays were tested for their memory characteristics as well as for their physical characteristics and the results are presented. The arrays produced during this program meet the essential goals and demonstrate the feasibility of fabricating monolithic ferrite memory arrays by the processes developed.

  4. Carbon Fiber Composite Monoliths as Catalyst Supports

    Energy Technology Data Exchange (ETDEWEB)

    Contescu, Cristian I [ORNL; Gallego, Nidia C [ORNL; Pickel, Joseph M [ORNL; Blom, Douglas Allen [ORNL; Burchell, Timothy D [ORNL

    2006-01-01

    Carbon fiber composite monoliths are rigid bodies that can be activated to a large surface area, have tunable porosity, and proven performance in gas separation and storage. They are ideal as catalyst supports in applications where a rigid support, with open structure and easy fluid access is desired. We developed a procedure for depositing a dispersed nanoparticulate phase of molybdenum carbide (Mo2C) on carbon composite monoliths in the concentration range of 3 to 15 wt% Mo. The composition and morphology of this phase was characterized using X-ray diffraction and electron microscopy, and a mechanism was suggested for its formation. Molybdenum carbide is known for its catalytic properties that resemble those of platinum group metals, but at a lower cost. The materials obtained are expected to demonstrate catalytic activity in a series of hydrocarbon reactions involving hydrogen transfer. This project demonstrates the potential of carbon fiber composite monoliths as catalyst supports.

  5. Carbon Fiber Composite Monoliths for Catalyst Supports

    Energy Technology Data Exchange (ETDEWEB)

    Contescu, Cristian I [ORNL; Gallego, Nidia C [ORNL; Pickel, Joseph M [ORNL; Blom, Douglas Allen [ORNL; Burchell, Timothy D [ORNL

    2006-01-01

    Carbon fiber composite monoliths are rigid bodies that can be activated to a large surface area, have tunable porosity, and proven performance in gas separation and storage. They are ideal as catalyst supports in applications where a rigid support, with open structure and easy fluid access is desired. We developed a procedure for depositing a dispersed nanoparticulate phase of molybdenum carbide (Mo2C) on carbon composite monoliths in the concentration range of 3 to 15 wt% Mo. The composition and morphology of this phase was characterized using X-ray diffraction and electron microscopy, and a mechanism was suggested for its formation. Molybdenum carbide is known for its catalytic properties that resemble those of platinum group metals, but at a lower cost. The materials obtained are expected to demonstrate catalytic activity in a series of hydrocarbon reactions involving hydrogen transfer. This project demonstrates the potential of carbon fiber composite monoliths as catalyst supports.

  6. Eigenpolarization theory of monolithic nonplanar ring oscillators

    Science.gov (United States)

    Nilsson, Alan C.; Gustafson, Eric K.; Byer, Robert L.

    1989-01-01

    Diode-laser-pumped monolithic nonplanar ring oscillators (NPROs) in an applied magnetic field can operate as unidirectional traveling-wave lasers. The diode laser pumping, monolithic construction, and unidirectional oscillation lead to narrow linewidth radiation. Here, a comprehensive theory of the eigenpolarizations of a monolithic NPRO is presented. It is shown how the properties of the integral optical diode that forces unidirectional operation depend on the choice of the gain medium, the applied magnetic field, the output coupler, and the geometry of the nonplanar ring light path. Using optical equivalence theorems to gain insight into the polarization characteristics of the NPRO, a strategy for designing NPROs with low thresholds and large loss nonreciprocities is given. An analysis of the eigenpolarizations for one such NPRO is presented, alternative optimization approaches are considered, and the prospects for further reducing the linewidths of these lasers are briefly discussed.

  7. Design of Hardware Pipelining Processor%硬件流水处理器设计

    Institute of Scientific and Technical Information of China (English)

    刘文波; 潘雪增

    2011-01-01

    This paper proposes a hardware pipelining processor for single-threads with granularity of code block.It makes use of compile technique to parallel program and divides program into blocks.Some blocks are arranged to be executed at processor's idle units to improve performance.The processor needs compiler support, but has less of modification to primary processor.Results show that spec2000 test program gets 40% performance improvement after using this processor.%利用核内空闲资源加速单线程程序执行的方法,将可并行的代码安排在核内空闲单元上执行,实现代码块在核内的流水操作,从而设计一种具有循环加速能力的硬件流水处理器,可通过改变取值结构和寄存器分配逻辑获得编译器的支持.结果表明,应用该处理器后的spec2000测试程序执行性能提升了40%.

  8. Parallel execution of chemical software on EGEE Grid

    CERN Document Server

    Sterzel, Mariusz

    2008-01-01

    Constant interest among chemical community to study larger and larger molecules forces the parallelization of existing computational methods in chemistry and development of new ones. These are main reasons of frequent port updates and requests from the community for the Grid ports of new packages to satisfy their computational demands. Unfortunately some parallelization schemes used by chemical packages cannot be directly used in Grid environment. Here we present a solution for Gaussian package. The current state of development of Grid middleware allows easy parallel execution in case of software using any of MPI flavour. Unfortunately many chemical packages do not use MPI for parallelization therefore special treatment is needed. Gaussian can be executed in parallel on SMP architecture or via Linda. These require reservation of certain number of processors/cores on a given WN and the equal number of processors/cores on each WN, respectively. The current implementation of EGEE middleware does not offer such f...

  9. Parallel plasma fluid turbulence calculations

    Energy Technology Data Exchange (ETDEWEB)

    Leboeuf, J.N.; Carreras, B.A.; Charlton, L.A.; Drake, J.B.; Lynch, V.E.; Newman, D.E.; Sidikman, K.L.; Spong, D.A.

    1994-12-31

    The study of plasma turbulence and transport is a complex problem of critical importance for fusion-relevant plasmas. To this day, the fluid treatment of plasma dynamics is the best approach to realistic physics at the high resolution required for certain experimentally relevant calculations. Core and edge turbulence in a magnetic fusion device have been modeled using state-of-the-art, nonlinear, three-dimensional, initial-value fluid and gyrofluid codes. Parallel implementation of these models on diverse platforms--vector parallel (National Energy Research Supercomputer Center`s CRAY Y-MP C90), massively parallel (Intel Paragon XP/S 35), and serial parallel (clusters of high-performance workstations using the Parallel Virtual Machine protocol)--offers a variety of paths to high resolution and significant improvements in real-time efficiency, each with its own advantages. The largest and most efficient calculations have been performed at the 200 Mword memory limit on the C90 in dedicated mode, where an overlap of 12 to 13 out of a maximum of 16 processors has been achieved with a gyrofluid model of core fluctuations. The richness of the physics captured by these calculations is commensurate with the increased resolution and efficiency and is limited only by the ingenuity brought to the analysis of the massive amounts of data generated.

  10. VLSI Processor For Vector Quantization

    Science.gov (United States)

    Tawel, Raoul

    1995-01-01

    Pixel intensities in each kernel compared simultaneously with all code vectors. Prototype high-performance, low-power, very-large-scale integrated (VLSI) circuit designed to perform compression of image data by vector-quantization method. Contains relatively simple analog computational cells operating on direct or buffered outputs of photodetectors grouped into blocks in imaging array, yielding vector-quantization code word for each such block in sequence. Scheme exploits parallel-processing nature of vector-quantization architecture, with consequent increase in speed.

  11. Options for Parallelizing a Planning and Scheduling Algorithm

    Science.gov (United States)

    Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin D.

    2011-01-01

    Space missions have a growing interest in putting multi-core processors onboard spacecraft. For many missions processing power significantly slows operations. We investigate how continual planning and scheduling algorithms can exploit multi-core processing and outline different potential design decisions for a parallelized planning architecture. This organization of choices and challenges helps us with an initial design for parallelizing the CASPER planning system for a mesh multi-core processor. This work extends that presented at another workshop with some preliminary results.

  12. Parallel optical interconnects: implementation of optoelectronics in multiprocessor architectures.

    Science.gov (United States)

    Frietman, E E; van Nifterick, W; Dekker, L; Jongeling, T J

    1990-03-10

    Performance and efficiency of multiple processor computers depend strongly on the network that interconnects the distinct collaborating processors. Constrained connectivity forces much of the potential computing speed to be used to compensate for the limitation in connections. The availability of a multiple parallel I/O connections allows full unrestricted connectivity and is an essential prerequisite for an interprocessor network that is able to meet the ever growing communication demands. This paper emphasizes the design, building and application of an electrooptic communication system [EOCS]. The EOCS uses dedicated free space multiple data distributors and integrated optically writable inputbuffer arrays with fully parallel access.

  13. Physical and chemical sensing using monolithic semiconductor optical transducers

    Science.gov (United States)

    Zappe, Hans P.; Hofstetter, Daniel; Maisenhoelder, Bernd; Moser, Michael; Riel, Peter; Kunz, Rino E.

    1997-09-01

    We present two monolithically integrated optical sensor systems based on semiconductor photonic integrated circuits. These compact, robust and highly functional transducers perform all necessary optical and electro-optical functions on-chip; extension to multi-sensor arrays is easily envisaged. A monolithic Michelson interferometer for high-resolution displacement measurement and a monolithic Mach-Zehnder interferometer for refractometry are discussed.

  14. Toward Performance Portability of the FV3 Weather Model on CPU, GPU and MIC Processors

    Science.gov (United States)

    Govett, Mark; Rosinski, James; Middlecoff, Jacques; Schramm, Julie; Stringer, Lynd; Yu, Yonggang; Harrop, Chris

    2017-04-01

    The U.S. National Weather Service has selected the FV3 (Finite Volume cubed) dynamical core to become part of the its next global operational weather prediction model. While the NWS is preparing to run FV3 operationally in late 2017, NOAA's Earth System Research Laboratory is adapting the model to be capable of running on next-generation GPU and MIC processors. The FV3 model was designed in the 1990s, and while it has been extensively optimized for traditional CPU chips, some code refactoring has been required to expose sufficient parallelism needed to run on fine-grain GPU processors. The code transformations must demonstrate bit-wise reproducible results with the original CPU code, and between CPU, GPU and MIC processors. We will describe the parallelization and performance while attempting to maintain performance portability between CPU, GPU and MIC with the Fortran source code. Performance results will be shown using NOAA's new Pascal based fine-grain GPU system (800 GPUs), and for the Knights Landing processor on the National Science Foundation (NSF) Stampede-2 system.

  15. Parallel implementation, validation, and performance of MM5

    Energy Technology Data Exchange (ETDEWEB)

    Michalakes, J.; Canfield, T.; Nanjundiah, R.; Hammond, S. [Argonne National Lab., IL (United States); Grell, G. [National Oceanic and Atmospheric Administration, Boulder, CO (United States)

    1994-12-31

    We describe a parallel implementation of the nonhydrostatic version of the Penn State/NCAR Mesoscale Model, MM5, that includes nesting capabilities. This version of the model can run on many different massively Parallel computers (including a cluster of workstations). The model has been implemented and run on the IBM SP and Intel multiprocessors using a columnwise decomposition that supports irregularly shaped allocations of the problem to processors. This stategy will facilitate dynamic load balancing for improved parallel efficiency and promotes a modular design that simplifies the nesting problem AU data communication for finite differencing, inter-domain exchange of data, and I/O is encapsulated within a parallel library, RSL. Hence, there are no sends or receives in the parallel model itself. The library is Generalizable to other, similar finite difference approximation codes. The code is validated by comparing the rate of growth in error between the sequential and parallel models with the error growth rate when the sequential model input is perturbed to simulate floating point rounding error. Series of runs on increasing numbers of parallel processors demonstrate that the parallel implementation is efficient and scalable to large numbers of processors.

  16. Increased thermal conductivity monolithic zeolite structures

    Science.gov (United States)

    Klett, James; Klett, Lynn; Kaufman, Jonathan

    2008-11-25

    A monolith comprises a zeolite, a thermally conductive carbon, and a binder. The zeolite is included in the form of beads, pellets, powders and mixtures thereof. The thermally conductive carbon can be carbon nano-fibers, diamond or graphite which provide thermal conductivities in excess of about 100 W/mK to more than 1,000 W/mK. A method of preparing a zeolite monolith includes the steps of mixing a zeolite dispersion in an aqueous colloidal silica binder with a dispersion of carbon nano-fibers in water followed by dehydration and curing of the binder is given.

  17. Characterization of CIM monoliths as enzyme reactors.

    Science.gov (United States)

    Vodopivec, Martina; Podgornik, Ales; Berovic, Marin; Strancar, Ales

    2003-09-25

    The immobilization of the enzymes citrate lyase, malate dehydrogenase, isocitrate dehydrogenase and lactate dehydrogenase to CIM monolithic supports was performed. The long-term stability, reproducibility, and linear response range of the immobilized enzyme reactors were investigated along with the determination of the kinetic behavior of the enzymes immobilized on the CIM monoliths. The Michaelis-Menten constant K(m) and the turnover number k(3) of the immobilized enzymes were found to be flow-unaffected. Furthermore, the K(m) values of the soluble and immobilized enzyme were found to be comparable. Both facts indicate the absence of a diffusional limitation in immobilized CIM enzyme reactors.

  18. Monolithically integrated optoelectronic down-converter (MIOD)

    Science.gov (United States)

    Portnoi, Efrim L.; Venus, G. B.; Khazan, A. A.; Gorfinkel, Vera B.; Kompa, Guenter; Avrutin, Evgenii A.; Thayne, Iain G.; Barrow, David A.; Marsh, John H.

    1995-06-01

    Optoelectronic down-conversion of very high-frequency amplitude-modulated signals using a semiconductor laser simultaneously as a local oscillator and a mixer is proposed. Three possible constructions of a monolithically integrated down-converter are considered theoretically: a four-terminal semiconductor laser with dual pumping current/modal gain control, and both a passively mode-locked and a passively Q-switched semiconductor laser monolithically integrated with an electroabsorption or pumping current modulator. Experimental verification of the feasibility of the concept of down conversion in a laser diode is presented.

  19. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    Energy Technology Data Exchange (ETDEWEB)

    Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core

  20. IDSP- INTERACTIVE DIGITAL SIGNAL PROCESSOR

    Science.gov (United States)

    Mish, W. H.

    1994-01-01

    The Interactive Digital Signal Processor, IDSP, consists of a set of time series analysis "operators" based on the various algorithms commonly used for digital signal analysis work. The processing of a digital time series to extract information is usually achieved by the application of a number of fairly standard operations. However, it is often desirable to "experiment" with various operations and combinations of operations to explore their effect on the results. IDSP is designed to provide an interactive and easy-to-use system for this type of digital time series analysis. The IDSP operators can be applied in any sensible order (even recursively), and can be applied to single time series or to simultaneous time series. IDSP is being used extensively to process data obtained from scientific instruments onboard spacecraft. It is also an excellent teaching tool for demonstrating the application of time series operators to artificially-generated signals. IDSP currently includes over 43 standard operators. Processing operators provide for Fourier transformation operations, design and application of digital filters, and Eigenvalue analysis. Additional support operators provide for data editing, display of information, graphical output, and batch operation. User-developed operators can be easily interfaced with the system to provide for expansion and experimentation. Each operator application generates one or more output files from an input file. The processing of a file can involve many operators in a complex application. IDSP maintains historical information as an integral part of each file so that the user can display the operator history of the file at any time during an interactive analysis. IDSP is written in VAX FORTRAN 77 for interactive or batch execution and has been implemented on a DEC VAX-11/780 operating under VMS. The IDSP system generates graphics output for a variety of graphics systems. The program requires the use of Versaplot and Template plotting

  1. Cache-Conscious Data Cube Computation on a Modern Processor

    Institute of Scientific and Technical Information of China (English)

    Hua Luan; Xiao-Yong Du; Shan Wang

    2009-01-01

    Data cube computation is an important problem in the field of data warehousing and OLAP (online analytical processing). Although it has been studied extensively in the past, most of its algorithms are designed without considering CPU and cache behavior. In this paper, we first propose a cache-conscious cubing approach called CC-Cubing to efficiently compute data cubes on a modern processor. This method can enhance CPU and cache performances. It adopts an integrated depth-first and breadth-first partitioning order and partitions multiple dimensions simultaneously. The partitioning scheme improves the data spatial locality and increases the utilization of cache lines. Software prefetching techniques are then applied in the sorting phase to hide the expensive cache misses associated with data scans. In addition, a cache-aware method is used in CC-Cubing to switch the sort algorithm dynamically. Our performance study shows that CC-Cubing outperforms BUC, Star-Cubing and MM-Cubing in most cases. Then, in order to fully utilize an SMT (simultaneous multithreading) processor, we present a thread-based CC-Cubing-SMT method. This parallel method provides an improvement up to 27% for the single-threaded CC-Cubing algorithm.

  2. Image coding using parallel implementations of the embedded zerotree wavelet algorithm

    Science.gov (United States)

    Creusere, Charles D.

    1996-03-01

    We explore here the implementation of Shapiro's embedded zerotree wavelet (EZW) image coding algorithms on an array of parallel processors. To this end, we first consider the problem of parallelizing the basic wavelet transform, discussing past work in this area and the compatibility of that work with the zerotree coding process. From this discussion, we present a parallel partitioning of the transform which is computationally efficient and which allows the wavelet coefficients to be coded with little or no additional inter-processor communication. The key to achieving low data dependence between the processors is to ensure that each processor contains only entire zerotrees of wavelet coefficients after the decomposition is complete. We next quantify the rate-distortion tradeoffs associated with different levels of parallelization for a few variations of the basic coding algorithm. Studying these results, we conclude that the quality of the coder decreases as the number of parallel processors used to implement it increases. Noting that the performance of the parallel algorithm might be unacceptably poor for large processor arrays, we also develop an alternate algorithm which always achieves the same rate-distortion performance as the original sequential EZW algorithm at the cost of higher complexity and reduced scalability.

  3. Special purpose parallel computer architecture for real-time control and simulation in robotic applications

    Science.gov (United States)

    Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

    1993-01-01

    This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.

  4. PARALLEL STABILIZATION

    Institute of Scientific and Technical Information of China (English)

    J.L.LIONS

    1999-01-01

    A new algorithm for the stabilization of (possibly turbulent, chaotic) distributed systems, governed by linear or non linear systems of equations is presented. The SPA (Stabilization Parallel Algorithm) is based on a systematic parallel decomposition of the problem (related to arbitrarily overlapping decomposition of domains) and on a penalty argument. SPA is presented here for the case of linear parabolic equations: with distrjbuted or boundary control. It extends to practically all linear and non linear evolution equations, as it will be presented in several other publications.

  5. Fully Parallel MHD Stability Analysis Tool

    Science.gov (United States)

    Svidzinski, Vladimir; Galkin, Sergei; Kim, Jin-Soo; Liu, Yueqiang

    2015-11-01

    Progress on full parallelization of the plasma stability code MARS will be reported. MARS calculates eigenmodes in 2D axisymmetric toroidal equilibria in MHD-kinetic plasma models. It is a powerful tool for studying MHD and MHD-kinetic instabilities and it is widely used by fusion community. Parallel version of MARS is intended for simulations on local parallel clusters. It will be an efficient tool for simulation of MHD instabilities with low, intermediate and high toroidal mode numbers within both fluid and kinetic plasma models, already implemented in MARS. Parallelization of the code includes parallelization of the construction of the matrix for the eigenvalue problem and parallelization of the inverse iterations algorithm, implemented in MARS for the solution of the formulated eigenvalue problem. Construction of the matrix is parallelized by distributing the load among processors assigned to different magnetic surfaces. Parallelization of the solution of the eigenvalue problem is made by repeating steps of the present MARS algorithm using parallel libraries and procedures. Results of MARS parallelization and of the development of a new fix boundary equilibrium code adapted for MARS input will be reported. Work is supported by the U.S. DOE SBIR program.

  6. Parallel storage and retrieval of pixmap images

    OpenAIRE

    Hersch, R.D.

    1993-01-01

    To fulfill the requirement of rapid access to huge amounts of uncompressed pixmap image data, a parallel image server architecture is proposed, based on arrays of intelligent disk nodes, with each disk node composed of one processor and one disk. It is shown how images can be partitioned into extents and efficiently distributed among available intelligent disk nodes. The image server's performance is analyzed according to various parameters such as the number of cooperating disk nodes, the si...

  7. A parallel, portable and versatile treecode

    Energy Technology Data Exchange (ETDEWEB)

    Warren, M.S. [Los Alamos National Lab., NM (United States); Salmon, J.K. [Australian National Univ., Canberra, ACT (Australia)]|[California Inst. of Technology, Pasadena, CA (United States)

    1994-10-01

    Portability and versatility are important characteristics of a computer program which is meant to be generally useful. We describe how we have developed a parallel N-body treecode to meet these goals. A variety of applications to which the code can be applied are mentioned. Performance of the program is also measured on several machines. A 512 processor Intel Paragon can solve for the forces on 10 million gravitationally interacting particles to 0.5% rms accuracy in 28.6 seconds.

  8. LOW COMPLEXITY CONSTRAINTS FOR ENERGY AND PERFORMANCE MANAGEMENT OF HETEROGENEOUS MULTICORE PROCESSORS USING DYNAMIC OPTIMIZATION

    Directory of Open Access Journals (Sweden)

    A. S. Radhamani

    2014-01-01

    Full Text Available Optimization in multicore processor environment is significant in real world dynamic applications, as it is crucial to find and track the change effectively over time, which requires an optimization algorithm. In massively parallel processing multicore processor architectures, like other population based metaheuristics Constraint based Bacterial Foraging Particle Swarm Optimization (CBFPSO scheduling can be effectively implemented. In this study we discuss possible approaches to parallelize CBFPSO in multicore system, which uses different constraints; to exploit parallelism are explored and evaluated. Due to the ability of keeping good balance between convergence and maintenance, for real world applications, among the various algorithms for parallel architecture optimization CBFPSOs are attracting more and more attentions in recent years. To tackle the challenges of parallel architecture optimization, several strategies have been proposed, to enhance the performance of Particle Swarm Optimization (PSO and have obtained success on various multicore parallel architecture optimization problems. But there still exist some issues in multicore architectures which require to be analyzed carefully. In this study, a new Constraint based Bacterial Foraging Particle Swarm Optimization (CBFPSO scheduling for multicore architecture is proposed, which updates the velocity and position by two bacterial behaviours, i.e., reproduction and elimination dispersal. The performance of CBFPSO is compared with the simulation results of GA and the result shows that the proposed algorithm has pretty good performance on almost all types of cores compared to GA with respect to completion time and energy consumption.

  9. A Parallel Interval Computation Model for Global Optimization with Automatic Load Balancing

    Institute of Scientific and Technical Information of China (English)

    Yong Wu; Arun Kumar

    2012-01-01

    In this paper,we propose a decentralized parallel computation model for global optimization using interval analysis.The model is adaptive to any number of processors and the workload is automatically and evenly distributed among all processors by alternative message passing.The problems received by each processor are processed based on their local dominance properties,which avoids unnecessary interval evaluations.Further,the problem is treated as a whole at the beginning of computation so that no initial decomposition scheme is required.Numerical experiments indicate that the model works well and is stable with different number of parallel processors,distributes the load evenly among the processors,and provides an impressive speedup,especially when the problem is time-consuming to solve.

  10. Real time continuous wavelet transform implementation on a DSP processor.

    Science.gov (United States)

    Patil, S; Abel, E W

    2009-01-01

    The continuous wavelet transform (CWT) is an effective tool when the emphasis is on the analysis of non-stationary signals and on localization and characterization of singularities in signals. We have used the B-spline based CWT, the Lipschitz Exponent (LE) and measures derived from it to detect and quantify the singularity characteristics of biomedical signals. In this article, a real-time implementation of a B-spline based CWT on a digital signal processor is presented, with the aim of providing quantitative information about the signal to a clinician as it is being recorded. A recursive algorithm implementation was shown to be too slow for real-time implementation so a parallel algorithm was considered. The use of a parallel algorithm involves redundancy in calculations at the boundary points. An optimization of numerical computation to remove redundancy in calculation was carried out. A formula has been derived to give an exact operation count for any integer scale m and any B-spline of order n (for the case where n is odd) to calculate the CWT for both the original and the optimized parallel methods. Experimental results show that the optimized method is 20-28% faster than the original method. As an example of applying this optimized method, a real-time implementation of the CWT with LE postprocessing has been achieved for an EMG Interference Pattern signal sampled at 50 kHz.

  11. Comparison Of Hybrid Sorting Algorithms Implemented On Different Parallel Hardware Platforms

    Directory of Open Access Journals (Sweden)

    Dominik Zurek

    2013-01-01

    Full Text Available Sorting is a common problem in computer science. There are lot of well-known sorting algorithms created for sequential execution on a single processor. Recently, hardware platforms enable to create wide parallel algorithms. We have standard processors consist of multiple cores and hardware accelerators like GPU. The graphic cards with their parallel architecture give new possibility to speed up many algorithms. In this paper we describe results of implementation of a few different sorting algorithms on GPU cards and multicore processors. Then hybrid algorithm will be presented which consists of parts executed on both platforms, standard CPU and GPU.

  12. Performance of a plasma fluid code on the Intel parallel computers

    Energy Technology Data Exchange (ETDEWEB)

    Lynch, V.E. (Martin Marietta Energy Systems, Inc., Oak Ridge, TN (United States)); Carreras, B.A.; Drake, J.B.; Leboeuf, J.N. (Oak Ridge National Lab., TN (United States)); Liewer, P. (Jet Propulsion Lab., Pasadena, CA (United States))

    1992-01-01

    One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel {sigma} machine gives an improvement factor close to 64 over the single-processor CRAY-2.

  13. Performance of a plasma fluid code on the Intel parallel computers

    Science.gov (United States)

    Lynch, V. E.; Carreras, B. A.; Drake, J. B.; Leboeuf, J. N.; Liewer, P.

    1992-01-01

    One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A parallel algorithm for plasma turbulence calculations was tested on the Intel iPSC/860 hypercube and the Touchtone Delta machine. Using the 128 processors of the Intel iPSC/860 hypercube, a factor of 5 improvement over a single-processor CRAY-2 is obtained. For the Touchtone Delta machine, the corresponding improvement factor is 16. For plasma edge turbulence calculations, an extrapolation of the present results to the Intel (sigma) machine gives an improvement factor close to 64 over the single-processor CRAY-2.

  14. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2013-01-01

    Experiments at the LHC hadron collider search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and instantaneus luminosity increase, increasingly complex and exclusive selections are necessary. We present results and performances of a new prototype of Associative Memory (AM) system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the ATLAS experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the "combinatorial challenge", is beat by the AM technology exploiting parallelism to the maximum level. The Associative Memory compares the event to pre-calculated "expectations" or "patterns" (pattern matching) at once and look for candidate tracks called "roads". The problem is solved by the time data are loaded into the AM devices. We report ...

  15. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2013-01-01

    Modern experiments search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and luminosity increase we need increasingly complex and exclusive selections. We present results and performances of a new prototype of Associative Memory system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the Atlas experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the “combinatorial challenge”, is beat by the Associative Memory (AM) technology exploiting parallelism to the maximum level: it compares the event to pre-calculated “expectations” or “patterns” (pattern matching) at once looking for candidate tracks called “roads”. The problem is solved by the time data are loaded into the AM devices. We report on the tests of the integrate...

  16. The Associative Memory system for the FTK processor at ATLAS

    CERN Document Server

    Cipriani, R; The ATLAS collaboration; Donati, S; Giannetti, P; Lanza, A; Luciano, P; Magalotti, D; Piendibene, M

    2014-01-01

    Modern experiments search for extremely rare processes hidden in much larger background levels. As the experiment complexity, the accelerator backgrounds and luminosity increase we need increasingly complex and exclusive selections. We present results and performances of a new prototype of Associative Memory system, the core of the Fast Tracker processor (FTK). FTK is a real time tracking device for the Atlas experiment trigger upgrade. The AM system provides massive computing power to minimize the online execution time of complex tracking algorithms. The time consuming pattern recognition problem, generally referred to as the “combinatorial challenge”, is beat by the Associative Memory (AM) technology exploiting parallelism to the maximum level: it compares the event to pre-calculated “expectations” or “patterns” (pattern matching) at once looking for candidate tracks called “roads”. The problem is solved by the time data are loaded into the AM devices. We report on the tests of the integrate...

  17. HIGH THROUGHPUT OF MAP PROCESSOR USING PIPELINE WINDOW DECODING

    Directory of Open Access Journals (Sweden)

    P. Nithya

    2012-11-01

    Full Text Available Turbo codes are one of the most efficient error correcting code which approaches the Shannon limit.The high throughput in turbo decoder can be achieved by parallelizing several soft Input Soft Output(SISOunits together.In this way,multiple SISO decoders work on the same data frame at the same values and delievers soft outputs can be split into three terms like the soft channel and a priori input and the extrinsic value.The extrinsic value is used for the next iteration.The high throughput of Max-Log-MAP processor tha supports both single Binary(SBand Double-binary(DB convolutional turbo codes.Decoding of these codes however an iterative processing is requires high computation rate and latency.Thus in order to achieve high throughput and to reduce latency by using serial processing techniques.The pipeline window(PWdecoding is introduced to support arbitrary frame sizes with high throughput.

  18. A parallel implementation of kriging with a trend

    Energy Technology Data Exchange (ETDEWEB)

    Gajraj, A.; Joubert, W.; Jones, J.

    1997-11-01

    This paper describes the parallelization of the GSLIB ktb3dm code. The code is parallelized using the message passing paradigm, Parallel Virtual Machine (PVM), under a Multiple Instructions, Multiple Data (MIMD) architecture. The code performance is analyzed using different grid sizes of 5x5x1, 50x50x1, 100x100x1 and 500x500x1 with 1, 2, 4, 8 and in some cases 16 processors on the Cray T3D supercomputer. The parallelization effort focused on the main kriging do loop. The results confirm that there is a substantial benefit to be derived in terms of CPU time savings (or execution speed) by using the parallel version of the code, especially when considering larger grids. Additionally, speed-up and scalability analyses show that actual speed-up is close to theoretical, while the code scales appropriately within the 1 to 16 processor range tested. The kriging of a quarter-million grid cell system fell from over 9 CPU minutes on one Cray T3D processor to about 1.25 CPU minutes on 16 processors on the same machine.

  19. Parallel Worlds

    DEFF Research Database (Denmark)

    Steno, Anne Mia

    2013-01-01

    as a symbol of something else, for instance as a way of handling uncertainty in difficult times, magical practice should also be seen as an emic concept. In this context, understanding the existence of two parallel universes, the profane and the magic, is important because the witches’ movements across...

  20. Multicentre evaluation of the Naída CI Q70 sound processor: feedback from cochlear implant users and professionals

    Directory of Open Access Journals (Sweden)

    Jeanette Martin

    2016-12-01

    Full Text Available The aim of this survey was to gather data from both implant recipients and professionals on the ease of use of the Naída CI Q70 (Naída CI sound processor from Advanced Bionics and on the usefulness of the new functions and features available. A secondary objective was to investigate fitting practices with the new processor. A comprehensive user satisfaction survey was conducted in a total of 186 subjects from 24 centres. In parallel, 23 professional questionnaires were collected from 11 centres. Overall, there was high satisfaction with the Naída CI processor from adults, children, experienced and new CI users as well as from professionals. The Naída CI processor was shown as being easy to use by all ages of recipients and by professionals. The majority of experienced CI users rated the Naída CI processor as being similar or better than their previous processor in all areas surveyed. The Naída CI was recommended by the professionals for fitting in all populations. Features like UltraZoom, ZoomControl and DuoPhone would not be fitted to very young children in contrast to adults. Positive ratings were obtained for ease of use, comfort and usefulness of the new functions and features of the Naída CI sound processor. Seventy-seven percent of the experienced CI users rated the new processor as being better than their previous sound processor from a general point of view. The survey also showed that fitting practices were influenced by the age of the user.

  1. Monolithic Integration of GaN-based LEDs

    Energy Technology Data Exchange (ETDEWEB)

    Ao, Jin-Ping, E-mail: jpao@ee.tokushima-u.ac.jp [Institute of Technology and Science, University of Tokushima 2-1 Minami-Josanjima, Tokushima 770-8506 (Japan)

    2011-02-01

    The technology of monolithically integrated GaN-based light-emitting diodes (LEDs) is reported. First, the technology details to realize monolithic integration are described, including the circuit design for high-voltage and alternating current (AC) operation and the technologies for device isolation. The performances of the fabricated monolithic LED arrays are then demonstrated. A monolithic series array with totally 40 LEDs exhibited expected operation function under AC bias. The operation voltage of the array is 72 V when 20 LEDs were connected in series. Some modified circuit designs for high-voltage operation and other monolithic LED arrays are finally reviewed.

  2. Monolithic blue LED series arrays for high-voltage AC operation

    Energy Technology Data Exchange (ETDEWEB)

    Ao, Jin-Ping [Satellite Venture Business Laboratory, University of Tokushima, Tokushima 770-8506 (Japan); Sato, Hisao; Mizobuchi, Takashi; Morioka, Kenji; Kawano, Shunsuke; Muramoto, Yoshihiko; Sato, Daisuke; Sakai, Shiro [Nitride Semiconductor Co. Ltd., Naruto, Tokushima 771-0360 (Japan); Lee, Young-Bae; Ohno, Yasuo [Department of Electrical and Electronic Engineering, University of Tokushima, Tokushima 770-8506 (Japan)

    2002-12-16

    Design and fabrication of monolithic blue LED series arrays that can be operated under high ac voltage are described. Several LEDs, such as 3, 7, and 20, are connected in series and in parallel to meet ac operation. The chip size of a single device is 150 {mu}m x 120 {mu}m and the total size is 1.1 mm x 1 mm for a 40(20+20) LED array. Deep dry etching was performed as device isolation. Two-layer interconnection and air bridge are utilized to connect the devices in an array. The monolithic series array exhibit the expected operation function under dc and ac bias. The output power and forward voltage are almost proportional to LED numbers connected in series. On-wafer measurement shows that the output power is 40 mW for 40(20+20) LED array under ac 72 V. (Abstract Copyright [2002], Wiley Periodicals, Inc.)

  3. Power estimation on functional level for programmable processors

    Directory of Open Access Journals (Sweden)

    M. Schneider

    2004-01-01

    the input parameters of the Correspondence to: H. Blume (blume@eecs.rwth-aachen.de arithmetic functions like e.g. the achieved degree of parallelism or the kind and number of memory accesses can be computed. This approach is exemplarily demonstrated and evaluated applying two modern digital signal processors and a variety of basic algorithms of digital signal processing. The resulting estimation values for the inspected algorithms are compared to physically measured values. A resulting maximum estimation error of 3% is achieved.

  4. Monolithic resonant optical reflector laser diodes

    Science.gov (United States)

    Hirata, T.; Suehiro, M.; Maeda, M.; Hihara, M.; Hosomatsu, H.

    1991-10-01

    The first monolithic resonant optical reflector laser diode that has a waveguide directional coupler and two DBR reflectors integrated by compositional disordering of quantum-well heterostructures is described. A linewidth of 440 kHz was obtained, and this value is expected to be greatly decreased by reducing the propagation loss in the integrated waveguide.

  5. Constant capacitance in nanopores of carbon monoliths.

    Science.gov (United States)

    García-Gómez, Alejandra; Moreno-Fernández, Gelines; Lobato, Belén; Centeno, Teresa A

    2015-06-28

    The results obtained for binder-free electrodes made of carbon monoliths with narrow micropore size distributions confirm that the specific capacitance in the electrolyte (C2H5)4NBF4/acetonitrile does not depend significantly on the micropore size and support the foregoing constant result of 0.094 ± 0.011 F m(-2).

  6. Structured building model reduction toward parallel simulation

    Energy Technology Data Exchange (ETDEWEB)

    Dobbs, Justin R. [Cornell University; Hencey, Brondon M. [Cornell University

    2013-08-26

    Building energy model reduction exchanges accuracy for improved simulation speed by reducing the number of dynamical equations. Parallel computing aims to improve simulation times without loss of accuracy but is poorly utilized by contemporary simulators and is inherently limited by inter-processor communication. This paper bridges these disparate techniques to implement efficient parallel building thermal simulation. We begin with a survey of three structured reduction approaches that compares their performance to a leading unstructured method. We then use structured model reduction to find thermal clusters in the building energy model and allocate processing resources. Experimental results demonstrate faster simulation and low error without any interprocessor communication.

  7. Primitive parallel operations for computational linear algebra

    Energy Technology Data Exchange (ETDEWEB)

    Panetta, J.

    1985-01-01

    This work is a small step in the direction of code portability over parallel and vector machines. The proposal consists of a style of programming and a set of parallel operators built over abstract data types. Objects and operators are directed to the Computational Linear Algebra area, although the principles of the proposal can be applied to any other area. A subset of the operators was implemented on a 64-processor, distributed memory MIMD machine, and the results are that computationally intensive operators achieve asymptotically optimal speed-ups, but data movement operators are inefficient, some even intrinsically sequential.

  8. Concurrency-based approaches to parallel programming

    Science.gov (United States)

    Kale, L.V.; Chrisochoides, N.; Kohl, J.; Yelick, K.

    1995-01-01

    The inevitable transition to parallel programming can be facilitated by appropriate tools, including languages and libraries. After describing the needs of applications developers, this paper presents three specific approaches aimed at development of efficient and reusable parallel software for irregular and dynamic-structured problems. A salient feature of all three approaches in their exploitation of concurrency within a processor. Benefits of individual approaches such as these can be leveraged by an interoperability environment which permits modules written using different approaches to co-exist in single applications.

  9. Advances in gallium arsenide monolithic microwave integrated-circuit technology for space communications systems

    Science.gov (United States)

    Bhasin, K. B.; Connolly, D. J.

    1986-01-01

    Future communications satellites are likely to use gallium arsenide (GaAs) monolithic microwave integrated-circuit (MMIC) technology in most, if not all, communications payload subsystems. Multiple-scanning-beam antenna systems are expected to use GaAs MMIC's to increase functional capability, to reduce volume, weight, and cost, and to greatly improve system reliability. RF and IF matrix switch technology based on GaAs MMIC's is also being developed for these reasons. MMIC technology, including gigabit-rate GaAs digital integrated circuits, offers substantial advantages in power consumption and weight over silicon technologies for high-throughput, on-board baseband processor systems. In this paper, current developments in GaAs MMIC technology are described, and the status and prospects of the technology are assessed.

  10. Real time processor for array speckle interferometry

    Science.gov (United States)

    Chin, Gordon; Florez, Jose; Borelli, Renan; Fong, Wai; Miko, Joseph; Trujillo, Carlos

    1989-01-01

    The authors are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element two-dimensional complex FFT (fast Fourier transform) and average the power spectrum, all within the 25 ms coherence time for speckles at near-IR (infrared) wavelength. The processor will be a compact unit controlled by a PC with real-time display and data storage capability. This will provide the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with offline methods. The image acquisition and processing, design criteria, and processor architecture are described.

  11. Making CSB+-Tree Processor Conscious

    DEFF Research Database (Denmark)

    Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

    2005-01-01

    Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....

  12. Intrusion Detection Architecture Utilizing Graphics Processors

    Directory of Open Access Journals (Sweden)

    Branislav Madoš

    2012-12-01

    Full Text Available With the thriving technology and the great increase in the usage of computer networks, the risk of having these network to be under attacks have been increased. Number of techniques have been created and designed to help in detecting and/or preventing such attacks. One common technique is the use of Intrusion Detection Systems (IDS. Today, number of open sources and commercial IDS are available to match enterprises requirements. However, the performance of these systems is still the main concern. This paper examines perceptions of intrusion detection architecture implementation, resulting from the use of graphics processor. It discusses recent research activities, developments and problems of operating systems security. Some exploratory evidence is presented that shows capabilities of using graphical processors and intrusion detection systems. The focus is on how knowledge experienced throughout the graphics processor inclusion has played out in the design of intrusion detection architecture that is seen as an opportunity to strengthen research expertise.

  13. ETHERNET PACKET PROCESSOR FOR SOC APPLICATION

    Directory of Open Access Journals (Sweden)

    Raja Jitendra Nayaka

    2012-07-01

    Full Text Available As the demand for Internet expands significantly in numbers of users, servers, IP addresses, switches and routers, the IP based network architecture must evolve and change. The design of domain specific processors that require high performance, low power and high degree of programmability is the bottleneck in many processor based applications. This paper describes the design of ethernet packet processor for system-on-chip (SoC which performs all core packet processing functions, including segmentation and reassembly, packetization classification, route and queue management which will speedup switching/routing performance. Our design has been configured for use with multiple projects ttargeted to a commercial configurable logic device the system is designed to support 10/100/1000 links with a speed advantage. VHDL has been used to implement and simulated the required functions in FPGA.

  14. SWIFT Privacy: Data Processor Becomes Data Controller

    Directory of Open Access Journals (Sweden)

    Edwin Jacobs

    2007-04-01

    Full Text Available Last month, SWIFT emphasised the urgent need for a solution to compliance with US Treasury subpoenas that provides legal certainty for the financial industry as well as for SWIFT. SWIFT will continue its activities to adhere to the Safe Harbor framework of the European data privacy legislation. Safe Harbor is a framework negotiated by the EU and US in 2000 to provide a way for companies in Europe, with operations in the US, to conform to EU data privacy regulations. This seems to conclude a complex privacy case, widely covered by the US and European media. A fundamental question in this case was who is a data controller and who is a mere data processor. Both the Belgian and the European privacy authorities considered SWIFT, jointly with the banks, as a data controller whereas SWIFT had considered itself as a mere data processor that processed financial data for banks. The difference between controller and processor has far reaching consequences.

  15. A new parallel-vector finite element analysis software on distributed-memory computers

    Science.gov (United States)

    Qin, Jiangning; Nguyen, Duc T.

    1993-01-01

    A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.

  16. Efficient SIMD optimization for media processors

    Institute of Scientific and Technical Information of China (English)

    Jian-peng ZHOU; Ce SHI

    2008-01-01

    Single instruction multiple data (SIMD) instructions are often implemented in modem media processors. Although SIMD instructions are useful in multimedia applications, most compilers do not have good support for SIMD instructions. This paper focuses on SIMD instructions generation for media processors. We present an efficient code optimization approach that is integrated into a retargetable C compiler. SIMD instructions are generated by finding and combining the same operations in programs. Experimental results for the UltraSPARC VIS instruction set show that a speedup factor up to 2.639 is obtained.

  17. 8 Bit RISC Processor Using Verilog HDL

    Directory of Open Access Journals (Sweden)

    Ramandeep Kaur

    2014-03-01

    Full Text Available RISC is a design philosophy to reduce the complexity of instruction set that in turn reduces the amount of space, cycle time, cost and other parameters taken into account during the implementation of the design. The advent of FPGA has enabled the complex logical systems to be implemented on FPGA. The intent of this paper is to design and implement 8 bit RISC processor using FPGA Spartan 3E tool. This processor design depends upon design specification, analysis and simulation. It takes into consideration very simple instruction set. The momentous components include Control unit, ALU, shift registers and accumulator register.

  18. Models of Communication for Multicore Processors

    DEFF Research Database (Denmark)

    Schoeberl, Martin; Sørensen, Rasmus Bo; Sparsø, Jens

    2015-01-01

    To efficiently use multicore processors we need to ensure that almost all data communication stays on chip, i.e., the bits moved between tasks executing on different processor cores do not leave the chip. Different forms of on-chip communication are supported by different hardware mechanism, e.......g., shared caches with cache coherency protocols, core-to-core networks-on-chip, and shared scratchpad memories. In this paper we explore the different hardware mechanism for on-chip communication and how they support or favor different models of communication. Furthermore, we discuss the usability...... of the different models of communication for real-time systems....

  19. Models of Communication for Multicore Processors

    DEFF Research Database (Denmark)

    Schoeberl, Martin; Sørensen, Rasmus Bo; Sparsø, Jens

    2015-01-01

    To efficiently use multicore processors we need to ensure that almost all data communication stays on chip, i.e., the bits moved between tasks executing on different processor cores do not leave the chip. Different forms of on-chip communication are supported by different hardware mechanism, e.......g., shared caches with cache coherency protocols, core-to-core networks-on-chip, and shared scratchpad memories. In this paper we explore the different hardware mechanism for on-chip communication and how they support or favor different models of communication. Furthermore, we discuss the usability...... of the different models of communication for real-time systems....

  20. Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors

    Directory of Open Access Journals (Sweden)

    Abdul Kareem PARCHUR

    2012-08-01

    Full Text Available High performance is a critical requirement to all microprocessors manufacturers. The present paper describes the comparison of performance in two main Intel Xeon series processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310. The microarchitecture of these processors is implemented using the basis of a new family of processors from Intel starting with the Pentium 4 processor. These processors can provide a performance boost for many key application areas in modern generation. The scaling of performance in two major series of Intel Xeon processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310 has been analyzed using the performance numbers of 12 CPU2006 integer benchmarks, performance numbers that exhibit significant differences in performance. The results and analysis can be used by performance engineers, scientists and developers to better understand the performance scaling in modern generation processors.

  1. Preemptive scheduling of parallel jobs on multiprocessors

    Energy Technology Data Exchange (ETDEWEB)

    Deng, Xiaotie; Gu, Nian; Brecht, T. [York Univ., North York (Canada); KaiCheng Lu [TsingHua Univ., Beijing (China)

    1996-12-31

    We study the problem of processor scheduling for parallel jobs. We prove that, for jobs with a single phase of parallelism, an algorithm can achieve a mean response time within 2 - 2/n+1 times the optimum. This is extended to jobs with multiple phases of parallelism and to interactive jobs (with phases during which the job has no CPU requirements) for a solution within 4 - 4/n+1 times the optimum. Comparing with previous work, our assumption that job execution times are unknown prior to their completion is more realistic, our multi-phased job model is more general, and our approximation ratio (for jobs with a single phase of parallelism) is better and cannot be improved.

  2. The PISCES 2 parallel programming environment

    Science.gov (United States)

    Pratt, Terrence W.

    1987-01-01

    PISCES 2 is a programming environment for scientific and engineering computations on MIMD parallel computers. It is currently implemented on a flexible FLEX/32 at NASA Langley, a 20 processor machine with both shared and local memories. The environment provides an extended Fortran for applications programming, a configuration environment for setting up a run on the parallel machine, and a run-time environment for monitoring and controlling program execution. This paper describes the overall design of the system and its implementation on the FLEX/32. Emphasis is placed on several novel aspects of the design: the use of a carefully defined virtual machine, programmer control of the mapping of virtual machine to actual hardware, forces for medium-granularity parallelism, and windows for parallel distribution of data. Some preliminary measurements of storage use are included.

  3. Inter Processor Communication for Fault Diagnosis in Multiprocessor Systems

    Directory of Open Access Journals (Sweden)

    C. D. Malleswar

    1994-04-01

    Full Text Available In the preseJlt paper a simple technique is proposed for fault diagnosis for multiprocessor and multiple system environments, wherein all microprocessors in the system are used in part to check the health of their neighbouring processors. It involves building simple fail-safe serial communication links between processors. Processors communicate with each other over these links and each processor is made to go through certain sequences of actions intended for diagnosis, under the observation of another processor .With limited overheads, fault detection can be done by this method. Also outlined are some of the popular techniques used for health check of processor-based systems.

  4. Compilation Techniques Specific for a Hardware Cryptography-Embedded Multimedia Mobile Processor

    Directory of Open Access Journals (Sweden)

    Masa-aki FUKASE

    2007-12-01

    Full Text Available The development of single chip VLSI processors is the key technology of ever growing pervasive computing to answer overall demands for usability, mobility, speed, security, etc. We have so far developed a hardware cryptography-embedded multimedia mobile processor architecture, HCgorilla. Since HCgorilla integrates a wide range of techniques from architectures to applications and languages, one-sided design approach is not always useful. HCgorilla needs more complicated strategy, that is, hardware/software (H/S codesign. Thus, we exploit the software support of HCgorilla composed of a Java interface and parallelizing compilers. They are assumed to be installed in servers in order to reduce the load and increase the performance of HCgorilla-embedded clients. Since compilers are the essence of software's responsibility, we focus in this article on our recent results about the design, specifications, and prototyping of parallelizing compilers for HCgorilla. The parallelizing compilers are composed of a multicore compiler and a LIW compiler. They are specified to abstract parallelism from executable serial codes or the Java interface output and output the codes executable in parallel by HCgorilla. The prototyping compilers are written in Java. The evaluation by using an arithmetic test program shows the reasonability of the prototyping compilers compared with hand compilers.

  5. Compilation Techniques Specific for a Hardware Cryptography-Embedded Multimedia Mobile Processor

    Directory of Open Access Journals (Sweden)

    Masa-aki FUKASE

    2007-12-01

    Full Text Available The development of single chip VLSI processors is the key technology of ever growing pervasive computing to answer overall demands for usability, mobility, speed, security, etc. We have so far developed a hardware cryptography-embedded multimedia mobile processor architecture, HCgorilla. Since HCgorilla integrates a wide range of techniques from architectures to applications and languages, one-sided design approach is not always useful. HCgorilla needs more complicated strategy, that is, hardware/software (H/S codesign. Thus, we exploit the software support of HCgorilla composed of a Java interface and parallelizing compilers. They are assumed to be installed in servers in order to reduce the load and increase the performance of HCgorilla-embedded clients. Since compilers are the essence of software's responsibility, we focus in this article on our recent results about the design, specifications, and prototyping of parallelizing compilers for HCgorilla. The parallelizing compilers are composed of a multicore compiler and a LIW compiler. They are specified to abstract parallelism from executable serial codes or the Java interface output and output the codes executable in parallel by HCgorilla. The prototyping compilers are written in Java. The evaluation by using an arithmetic test program shows the reasonability of the prototyping compilers compared with hand compilers.

  6. Scalable Parallel Implementation of GEANT4 Using Commodity Hardware and Task Oriented Parallel C

    CERN Document Server

    Cooperman, G; Grinberg, V; McCauley, T; Reucroft, S; Swain, J D; Cooperman, Gene; Anchordoqui, Luis; Grinberg, Victor; Cauley, Thomas Mc; Reucroft, Stephen; Swain, John

    2000-01-01

    We describe a scalable parallelization of Geant4 using commodity hardware in a collaborative effort between the College of Computer Science and the Department of Physics at Northeastern University. The system consists of a Beowulf cluster of 32 Pentium II processors with 128 MBytes of memory each, connected via ATM and fast Ethernet. The bulk of the parallelization is done using TOP-C (Task Oriented Parallel C), software widely used in the computational algebra community. TOP-C provides a flexible and powerful framework for parallel algorithm development, is easy to learn, and is available at no cost. Its task oriented nature allows one to parallelize legacy code while hiding the details of interprocess communications. Applications include fast interactive simulation of computationally intensive processes such as electromagnetic showers. General results motivate wider applications of TOP-C to other simulation problems as well as to pattern recognition in high energy physics.

  7. Poker on the Cosmic Cube: The First Retargetable Parallel Programming Language and Environment.

    Science.gov (United States)

    1986-06-01

    parallel programming environment, to new parallel architectures. The specifics are illustrated by describing the retarget of Poker to CalTech’s Cosmic Cube. Poker requires only three features from the target architecture: MIMD operation, message passing inter-process communication, and a sequential language (e.g. C) for the processor elements. In return Poker gives the new architecture a complete parallel programming environment which will compile Poker parallel programs without modification, into efficient object code for the new architecture.

  8. Monolithic silicon photonics in a sub-100nm SOI CMOS microprocessor foundry: progress from devices to systems

    Science.gov (United States)

    Popović, Miloš A.; Wade, Mark T.; Orcutt, Jason S.; Shainline, Jeffrey M.; Sun, Chen; Georgas, Michael; Moss, Benjamin; Kumar, Rajesh; Alloatti, Luca; Pavanello, Fabio; Chen, Yu-Hsin; Nammari, Kareem; Notaros, Jelena; Atabaki, Amir; Leu, Jonathan; Stojanović, Vladimir; Ram, Rajeev J.

    2015-02-01

    We review recent progress of an effort led by the Stojanović (UC Berkeley), Ram (MIT) and Popović (CU Boulder) research groups to enable the design of photonic devices, and complete on-chip electro-optic systems and interfaces, directly in standard microelectronics CMOS processes in a microprocessor foundry, with no in-foundry process modifications. This approach allows tight and large-scale monolithic integration of silicon photonics with state-of-the-art (sub-100nm-node) microelectronics, here a 45nm SOI CMOS process. It enables natural scale-up to manufacturing, and rapid advances in device design due to process repeatability. The initial driver application was addressing the processor-to-memory communication energy bottleneck. Device results include 5Gbps modulators based on an interleaved junction that take advantage of the high resolution of the sub-100nm CMOS process. We demonstrate operation at 5fJ/bit with 1.5dB insertion loss and 8dB extinction ratio. We also demonstrate the first infrared detectors in a zero-change CMOS process, using absorption in transistor source/drain SiGe stressors. Subsystems described include the first monolithically integrated electronic-photonic transmitter on chip (modulator+driver) with 20-70fJ/bit wall plug energy/bit (2-3.5Gbps), to our knowledge the lowest transmitter energy demonstrated to date. We also demonstrate native-process infrared receivers at 220fJ/bit (5Gbps). These are encouraging signs for the prospects of monolithic electronics-photonics integration. Beyond processor-to-memory interconnects, our approach to photonics as a "More-than- Moore" technology inside advanced CMOS promises to enable VLSI electronic-photonic chip platforms tailored to a vast array of emerging applications, from optical and acoustic sensing, high-speed signal processing, RF and optical metrology and clocks, through to analog computation and quantum technology.

  9. Parallel SAT Solving using Bit-level Operations

    NARCIS (Netherlands)

    Heule, M.J.H.; Van Maaren, H.

    2008-01-01

    We show how to exploit the 32/64 bit architecture of modern computers to accelerate some of the algorithms used in satisfiability solving by modifying assignments to variables in parallel on a single processor. Techniques such as random sampling demonstrate that while using bit vectors instead of Bo

  10. Space Station Water Processor Process Pump

    Science.gov (United States)

    Parker, David

    1995-01-01

    This report presents the results of the development program conducted under contract NAS8-38250-12 related to the International Space Station (ISS) Water Processor (WP) Process Pump. The results of the Process Pumps evaluation conducted on this program indicates that further development is required in order to achieve the performance and life requirements for the ISSWP.

  11. Fuel processors for fuel cell APU applications

    Science.gov (United States)

    Aicher, T.; Lenz, B.; Gschnell, F.; Groos, U.; Federici, F.; Caprile, L.; Parodi, L.

    The conversion of liquid hydrocarbons to a hydrogen rich product gas is a central process step in fuel processors for auxiliary power units (APUs) for vehicles of all kinds. The selection of the reforming process depends on the fuel and the type of the fuel cell. For vehicle power trains, liquid hydrocarbons like gasoline, kerosene, and diesel are utilized and, therefore, they will also be the fuel for the respective APU systems. The fuel cells commonly envisioned for mobile APU applications are molten carbonate fuel cells (MCFC), solid oxide fuel cells (SOFC), and proton exchange membrane fuel cells (PEMFC). Since high-temperature fuel cells, e.g. MCFCs or SOFCs, can be supplied with a feed gas that contains carbon monoxide (CO) their fuel processor does not require reactors for CO reduction and removal. For PEMFCs on the other hand, CO concentrations in the feed gas must not exceed 50 ppm, better 20 ppm, which requires additional reactors downstream of the reforming reactor. This paper gives an overview of the current state of the fuel processor development for APU applications and APU system developments. Furthermore, it will present the latest developments at Fraunhofer ISE regarding fuel processors for high-temperature fuel cell APU systems on board of ships and aircrafts.

  12. A Demo Processor as an Educational Tool

    NARCIS (Netherlands)

    van Moergestel, L.; van Nieuwland, K.; Vermond, L.; Meyer, John-Jules Charles

    2014-01-01

    Explaining the workings of a processor can be done in several ways. Just a written explanation, some pictures, a simulator program or a real hardware demo. The research presented here is based on the idea that a slowly working hardware demo could be a nice tool to explain to IT students the inner

  13. Continuous history variable for programmable quantum processors

    CERN Document Server

    Vlasov, Alexander Yu

    2010-01-01

    In this brief note is discussed application of continuous quantum history ("trash") variable for simplification of scheme of programmable quantum processor. Similar scheme may be tested also in other models of the theory of quantum algorithms and complexity, because provides modification of a standard operation: quantum function evaluation.

  14. Focal-plane sensor-processor chips

    CERN Document Server

    Zarándy, Ákos

    2011-01-01

    Focal-Plane Sensor-Processor Chips explores both the implementation and application of state-of-the-art vision chips. Presenting an overview of focal plane chip technology, the text discusses smart imagers and cellular wave computers, along with numerous examples of current vision chips.

  15. Microarchitecture of the Godson-2 Processor

    Institute of Scientific and Technical Information of China (English)

    Wei-Wu Hu; Fu-Xin Zhang; Zu-Song Li

    2005-01-01

    The Godson project is the first attempt to design high performance general-purpose microprocessors in China.This paper introduces the microarchitecture of the Godson-2 processor which is a 64-bit, 4-issue, out-of-order execution RISC processor that implements the 64-bit MIPS-like instruction set. The adoption of the aggressive out-of-order execution techniques (such as register mapping, branch prediction, and dynamic scheduling) and cache techniques (such as non-blocking cache, load speculation, dynamic memory disambiguation) helps the Godson-2 processor to achieve high performance even at not so high frequency. The Godson-2 processor has been physically implemented on a 6-metal 0.18μm CMOS technology based on the automatic placing and routing flow with the help of some crafted library cells and macros. The area of the chip is 6,700 micrometers by 6,200 micrometers and the clock cycle at typical corner is 2.3ns.

  16. Noise limitations in optical linear algebra processors.

    Science.gov (United States)

    Batsell, S G; Jong, T L; Walkup, J F; Krile, T F

    1990-05-10

    A general statistical noise model is presented for optical linear algebra processors. A statistical analysis which includes device noise, the multiplication process, and the addition operation is undertaken. We focus on those processes which are architecturally independent. Finally, experimental results which verify the analytical predictions are also presented.

  17. Practical guide to energy management for processors

    CERN Document Server

    Consortium, Energywise

    2012-01-01

    Do you know how best to manage and reduce your energy consumption? This book gives comprehensive guidance on effective energy management for organisations in the polymer processing industry. This book is one of three which support the ENERGYWISE Plastics Project eLearning platform for European plastics processors to increase their knowledge and understanding of energy management. Topics covered include: Understanding Energy,

  18. Globe hosts launch of new processor

    CERN Multimedia

    2006-01-01

    Launch of the quadecore processor chip at the Globe. On 14 November, in a series of major media events around the world, the chip-maker Intel launched its new 'quadcore' processor. For the regions of Europe, the Middle East and Africa, the day-long launch event took place in CERN's Globe of Science and Innovation, with over 30 journalists in attendance, coming from as far away as Johannesburg and Dubai. CERN was a significant choice for the event: the first tests of this new generation of processor in Europe had been made at CERN over the preceding months, as part of CERN openlab, a research partnership with leading IT companies such as Intel, HP and Oracle. The event also provided the opportunity for the journalists to visit ATLAS and the CERN Computer Centre. The strategy of putting multiple processor cores on the same chip, which has been pursued by Intel and other chip-makers in the last few years, represents an important departure from the more traditional improvements in the sheer speed of such chips. ...

  19. Analysis of Reconfigurable Processors Using Petri Net

    Directory of Open Access Journals (Sweden)

    Hadis Heidari

    2013-07-01

    Full Text Available In this paper, we propose Petri net models for processing elements. The processing elements include: a general-purpose processor (GPP, a reconfigurable element (RE, and a hybrid element (combining a GPP with an RE. The models consist of many transitions and places. The model and associated analysis methods provide a promising tool for modeling and performance evaluation of reconfigurable processors. The model is demonstrated by considering a simple example. This paper describes the development of a reconfigurable processor; the developed system is based on the Petri net concept. Petri nets are becoming suitable as a formal model for hardware system design. Designers can use Petri net as a modeling language to perform high level analysis of complex processors designs processing chips. The simulation does with PIPEv4.1 simulator. The simulation results show that Petri net state spaces are bounded and safe and have not deadlock and the average of number tokens in first token is 0.9901 seconds. In these models, there are only 5% errors; also the analysis time in these models is 0.016 seconds.

  20. Multi-petascale highly efficient parallel supercomputer

    Energy Technology Data Exchange (ETDEWEB)

    Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.; Blumrich, Matthias A.; Boyle, Peter; Brunheroto, Jose R.; Chen, Dong; Cher, Chen -Yong; Chiu, George L.; Christ, Norman; Coteus, Paul W.; Davis, Kristan D.; Dozsa, Gabor J.; Eichenberger, Alexandre E.; Eisley, Noel A.; Ellavsky, Matthew R.; Evans, Kahn C.; Fleischer, Bruce M.; Fox, Thomas W.; Gara, Alan; Giampapa, Mark E.; Gooding, Thomas M.; Gschwind, Michael K.; Gunnels, John A.; Hall, Shawn A.; Haring, Rudolf A.; Heidelberger, Philip; Inglett, Todd A.; Knudson, Brant L.; Kopcsay, Gerard V.; Kumar, Sameer; Mamidala, Amith R.; Marcella, James A.; Megerian, Mark G.; Miller, Douglas R.; Miller, Samuel J.; Muff, Adam J.; Mundy, Michael B.; O' Brien, John K.; O' Brien, Kathryn M.; Ohmacht, Martin; Parker, Jeffrey J.; Poole, Ruth J.; Ratterman, Joseph D.; Salapura, Valentina; Satterfield, David L.; Senger, Robert M.; Smith, Brian; Steinmacher-Burow, Burkhard; Stockdell, William M.; Stunkel, Craig B.; Sugavanam, Krishnan; Sugawara, Yutaka; Takken, Todd E.; Trager, Barry M.; Van Oosten, James L.; Wait, Charles D.; Walkup, Robert E.; Watson, Alfred T.; Wisniewski, Robert W.; Wu, Peng

    2015-07-14

    A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.

  1. Parallel continuous flow: a parallel suffix tree construction tool for whole genomes.

    Science.gov (United States)

    Comin, Matteo; Farreras, Montse

    2014-04-01

    The construction of suffix trees for very long sequences is essential for many applications, and it plays a central role in the bioinformatic domain. With the advent of modern sequencing technologies, biological sequence databases have grown dramatically. Also the methodologies required to analyze these data have become more complex everyday, requiring fast queries to multiple genomes. In this article, we present parallel continuous flow (PCF), a parallel suffix tree construction method that is suitable for very long genomes. We tested our method for the suffix tree construction of the entire human genome, about 3GB. We showed that PCF can scale gracefully as the size of the input genome grows. Our method can work with an efficiency of 90% with 36 processors and 55% with 172 processors. We can index the human genome in 7 minutes using 172 processes.

  2. Parallel R

    CERN Document Server

    McCallum, Ethan

    2011-01-01

    It's tough to argue with R as a high-quality, cross-platform, open source statistical software product-unless you're in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets. You'll learn the basics of Snow, Multicore, Parallel, and some Hadoop-related tools, including how to find them, how to use them, when they work well, and when they don't. With these packages, you can overcome R's single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R's memory barrier.

  3. Cyclic Redundancy Checking (CRC) Accelerator for Embedded Processor Datapaths

    National Research Council Canada - National Science Library

    Abdul Rehman Buzdar; Liguo Sun; Rao Kashif; Muhammad Waqar Azhar; Muhammad Imran Khan

    2017-01-01

    We present the integration of a multimode Cyclic Redundancy Checking (CRC) accelerator unit with an embedded processor datapath to enhance the processor performance in terms of execution time and energy efficiency...

  4. Area and Energy Efficient Viterbi Accelerator for Embedded Processor Datapaths

    National Research Council Canada - National Science Library

    Abdul Rehman Buzdar; Liguo Sun; Muhammad Waqar Azhar; Muhammad Imran Khan; Rao Kashif

    2017-01-01

    .... We present the integration of a mixed hardware/software viterbi accelerator unit with an embedded processor datapath to enhance the processor performance in terms of execution time and energy efficiency...

  5. The Parallelism of Traditional Transaction Model%传统事务模型的并行性

    Institute of Scientific and Technical Information of China (English)

    张志强; 李建中; 周立柱

    2001-01-01

    Transaction is a very important concept in DBMS,which has several features such as consistency,atomicity, durability and isolation. In this paper, we first analyze the parallelism of traditional transaction model. Next we point out that we can investigate more parallelism with a high parallel processing manner underlying multi-processors parallel structures. We will then compare the influence of two different software architectures on database system parallelism.

  6. Parallelization of a Monte Carlo particle transport simulation code

    Science.gov (United States)

    Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

    2010-05-01

    We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.

  7. Detector defect correction of medical images on graphics processors

    Science.gov (United States)

    Membarth, Richard; Hannig, Frank; Teich, Jürgen; Litz, Gerhard; Hornegger, Heinz

    2011-03-01

    The ever increasing complexity and power dissipation of computer architectures in the last decade blazed the trail for more power efficient parallel architectures. Hence, such architectures like field-programmable gate arrays (FPGAs) and particular graphics cards attained great interest and are consequently adopted for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, there is little effort to deploy barely computational, but memory intensive applications to graphics hardware. This paper considers a memory intensive detector defect correction pipeline for medical imaging with strict latency requirements. The image pipeline compensates for different effects caused by the detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. So far, dedicated hardware setups with special processors like DSPs were used for such critical processing. We show that this is today feasible with commodity graphics hardware. Using CUDA as programming model, it is demonstrated that the detector defect correction pipeline consisting of more than ten algorithms is significantly accelerated and that a speedup of 20x can be achieved on NVIDIA's Quadro FX 5800 compared to our reference implementation. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead of successive images to the graphics card memory is reduced by 83% using double buffering.

  8. Concurrent and Accurate Short Read Mapping on Multicore Processors.

    Science.gov (United States)

    Martínez, Héctor; Tárraga, Joaquín; Medina, Ignacio; Barrachina, Sergio; Castillo, Maribel; Dopazo, Joaquín; Quintana-Ortí, Enrique S

    2015-01-01

    We introduce a parallel aligner with a work-flow organization for fast and accurate mapping of RNA sequences on servers equipped with multicore processors. Our software, HPG Aligner SA (HPG Aligner SA is an open-source application. The software is available at http://www.opencb.org, exploits a suffix array to rapidly map a large fraction of the RNA fragments (reads), as well as leverages the accuracy of the Smith-Waterman algorithm to deal with conflictive reads. The aligner is enhanced with a careful strategy to detect splice junctions based on an adaptive division of RNA reads into small segments (or seeds), which are then mapped onto a number of candidate alignment locations, providing crucial information for the successful alignment of the complete reads. The experimental results on a platform with Intel multicore technology report the parallel performance of HPG Aligner SA, on RNA reads of 100-400 nucleotides, which excels in execution time/sensitivity to state-of-the-art aligners such as TopHat 2+Bowtie 2, MapSplice, and STAR.

  9. Macroporous Monolithic Polymers: Preparation and Applications

    Directory of Open Access Journals (Sweden)

    Cecilia Inés Alvarez Igarzabal

    2009-12-01

    Full Text Available In the last years, macroporous monolithic materials have been introduced as a new and useful generation of polymers used in different fields. These polymers may be prepared in a simple way from a homogenous mixture into a mold and contain large interconnected pores or channels allowing for high flow rates at moderate pressures. Due to their porous characteristics, they could be used in different processes, such as stationary phases for different types of chromatography, high-throughput bioreactors and in microfluidic chip applications. This review reports the contributions of several groups working in the preparation of different macroporous monoliths and their modification by immobilization of specific ligands on the products for specific purposes.

  10. Monolithic pixel detectors for high energy physics

    CERN Document Server

    Snoeys, W

    2013-01-01

    Monolithic pixel detectors integrating sensor matrix and readout in one piece of silicon have revolutionized imaging for consumer applications, but despite years of research they have not yet been widely adopted for high energy physics. Two major requirements for this application, radiation tolerance and low power consumption, require charge collection by drift for the most extreme radiation levels and an optimization of the collected signal charge over input capacitance ratio ( Q / C ). It is shown that monolithic detectors can achieve Q / C for low analog power consumption and even carryout the promise to practically eliminate analog power consumption, but combining suf fi cient Q / C , collection by drift, and integration of readout circuitry within the pixel remains a challenge. An overview is given of different approaches to address this challenge, with possible advantages and disadvantages.

  11. Array processors based on Gaussian fraction-free method

    Energy Technology Data Exchange (ETDEWEB)

    Peng, S.; Sedukhin, S. [Aizu Univ., Aizuwakamatsu, Fukushima (Japan); Sedukhin, I.

    1998-03-01

    The design of algorithmic array processors for solving linear systems of equations using fraction-free Gaussian elimination method is presented. The design is based on a formal approach which constructs a family of planar array processors systematically. These array processors are synthesized and analyzed. It is shown that some array processors are optimal in the framework of linear allocation of computations and in terms of number of processing elements and computing time. (author)

  12. Speedup bioinformatics applications on multicore-based processor using vectorizing and multithreading strategies.

    Science.gov (United States)

    Chaichoompu, Kridsadakorn; Kittitornkun, Surin; Tongsima, Sissades

    2007-12-30

    Many computational intensive bioinformatics software, such as multiple sequence alignment, population structure analysis, etc., written in C/C++ are not multicore-aware. A multicore processor is an emerging CPU technology that combines two or more independent processors into a single package. The Single Instruction Multiple Data-stream (SIMD) paradigm is heavily utilized in this class of processors. Nevertheless, most popular compilers including Microsoft Visual C/C++ 6.0, x86 gnu C-compiler gcc do not automatically create SIMD code which can fully utilize the advancement of these processors. To harness the power of the new multicore architecture certain compiler techniques must be considered. This paper presents a generic compiling strategy to assist the compiler in improving the performance of bioinformatics applications written in C/C++. The proposed framework contains 2 main steps: multithreading and vectorizing strategies. After following the strategies, the application can achieve higher speedup by taking the advantage of multicore architecture technology. Due to the extremely fast interconnection networking among multiple cores, it is suggested that the proposed optimization could be more appropriate than making use of parallelization on a small cluster computer which has larger network latency and lower bandwidth.

  13. Update On Monolithic Fuel Fabrication Development

    Energy Technology Data Exchange (ETDEWEB)

    C. R Clark; J. M. Wight; G. C. Knighton; G. A. Moore; J. F. Jue

    2005-11-01

    Efforts to develop a viable monolithic research reactor fuel plate have continued at Idaho National Laboratory. These efforts have concentrated on both fabrication process refinement and scale-up to produce full sized fuel plates. Advancements have been made in the production of U-Mo foil including full sized foils. Progress has also been made in the friction stir welding and transient liquid phase bonding fabrication processes resulting in better bonding, more stable processes and the ability to fabricate larger fuel plates.

  14. FLUIDIZED BED STEAM REFORMER MONOLITH FORMATION

    Energy Technology Data Exchange (ETDEWEB)

    Jantzen, C

    2006-12-22

    Fluidized Bed Steam Reforming (FBSR) is being considered as an alternative technology for the immobilization of a wide variety of aqueous high sodium containing radioactive wastes at various DOE facilities in the United States. The addition of clay, charcoal, and a catalyst as co-reactants converts aqueous Low Activity Wastes (LAW) to a granular or ''mineralized'' waste form while converting organic components to CO{sub 2} and steam, and nitrate/nitrite components, if any, to N{sub 2}. The waste form produced is a multiphase mineral assemblage of Na-Al-Si (NAS) feldspathoid minerals with cage-like structures that atomically bond radionuclides like Tc-99 and anions such as SO{sub 4}, I, F, and Cl. The granular product has been shown to be as durable as LAW glass. Shallow land burial requires that the mineralized waste form be able to sustain the weight of soil overburden and potential intrusion by future generations. The strength requirement necessitates binding the granular product into a monolith. FBSR mineral products were formulated into a variety of monoliths including various cements, Ceramicrete, and hydroceramics. All but one of the nine monoliths tested met the <2g/m{sup 2} durability specification for Na and Re (simulant for Tc-99) when tested using the Product Consistency Test (PCT; ASTM C1285). Of the nine monoliths tested the cements produced with 80-87 wt% FBSR product, the Ceramicrete, and the hydroceramic produced with 83.3 wt% FBSR product, met the compressive strength and durability requirements for an LAW waste form.

  15. Monolithically integrated interferometer for optical displacement measurement

    Science.gov (United States)

    Hofstetter, Daniel; Zappe, Hans P.

    1996-01-01

    We discuss the fabrication of a monolithically integrated optical displacement sensors using III-V semiconductor technology. The device is configured as a Michelson interferometer and consists of a distributed Bragg reflector laser, a photodetector and waveguides forming a directional coupler. Using this interferometer, displacements in the 100 nm range could be measured at distances of up to 45 cm. We present fabrication, device results and characterization of the completed interferometer, problems, limitations and future applications will also be discussed.

  16. Multicentre evaluation of the Naída CI Q70 sound processor: feedback from cochlear implant users and professionals

    OpenAIRE

    Jeanette Martin; Christine Poncet-Wallet; Angelika Illg; Sarah Perrin-Webb; Lise Henderson; Nathalie Noël-Petroff; Gennaro Auletta; Maria Grazia Barezzani; Karim Houri; Indian Research Group; Heike Bagus; Ulrich Hoppe; Jane Humphries; Wiebke van Treeck; Briaire, Jeroen J

    2016-01-01

    The aim of this survey was to gather data from both implant recipients and professionals on the ease of use of the Naída CI Q70 (Naída CI) sound processor from Advanced Bionics and on the usefulness of the new functions and features available. A secondary objective was to investigate fitting practices with the new processor. A comprehensive user satisfaction survey was conducted in a total of 186 subjects from 24 centres. In parallel, 23 professional questionnaires were collected from 11 cent...

  17. Optimizing FORTRAN Programs for Hierarchical Memory Parallel Processing Systems

    Institute of Scientific and Technical Information of China (English)

    金国华; 陈福接

    1993-01-01

    Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism.

  18. Kalman Filter Tracking on Parallel Architectures

    Directory of Open Access Journals (Sweden)

    Cerati Giuseppe

    2016-01-01

    Full Text Available Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC, for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.

  19. Kalman Filter Tracking on Parallel Architectures

    Science.gov (United States)

    Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

    2016-11-01

    Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC), for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.

  20. An overview of monolithic zirconia in dentistry

    Directory of Open Access Journals (Sweden)

    Özlem Malkondu

    2016-07-01

    Full Text Available Zirconia restorations have been used successfully for years in dentistry owing to their biocompatibility and good mechanical properties. Because of their lack of translucency, zirconia cores are generally veneered with porcelain, which makes restorations weaker due to failure of the adhesion between the two materials. In recent years, all-ceramic zirconia restorations have been introduced in the dental sector with the intent to solve this problem. Besides the elimination of chipping, the reduced occlusal space requirement seems to be a clear advantage of monolithic zirconia restorations. However, scientific evidence is needed to recommend this relatively new application for clinical use. This mini-review discusses the current scientific literature on monolithic zirconia restorations. The results of in vitro studies suggested that monolithic zirconia may be the best choice for posterior fixed partial dentures in the presence of high occlusal loads and minimal occlusal restoration space. The results should be supported with much more in vitro and particularly in vivo studies to obtain a final conclusion.

  1. 49 CFR 234.275 - Processor-based systems.

    Science.gov (United States)

    2010-10-01

    ..., DEPARTMENT OF TRANSPORTATION GRADE CROSSING SIGNAL SYSTEM SAFETY AND STATE ACTION PLANS Maintenance, Inspection, and Testing Requirements for Processor-Based Systems § 234.275 Processor-based systems. (a... 49 Transportation 4 2010-10-01 2010-10-01 false Processor-based systems. 234.275 Section...

  2. A lock circuit for a multi-core processor

    DEFF Research Database (Denmark)

    2015-01-01

    An integrated circuit comprising a multiple processor cores and a lock circuit that comprises a queue register with respective bits set or reset via respective, connections dedicated to respective processor cores, whereby the queue register identifies those among the multiple processor cores that...

  3. Vacuum-Induced Surface Freezing to Produce Monoliths of Aligned Porous Alumina

    Directory of Open Access Journals (Sweden)

    Sandra Großberger

    2016-12-01

    Full Text Available Vacuum-induced surface freezing has been used to produce uni-directional freezing of colloidal aluminum oxide dispersions. It leads to zones of different structure within the resulting sintered monoliths that are highly similar to those known for freeze casting using a cryogen cold source. A more-or-less dense surface layer and a cellular sub-surface region are formed, beneath which is a middle region of aligned lamellae and pores that stretches through most of the depth of the monolith. This is the case even at a volume fraction of dispersed phase as low as 0.032. A more-dense but still porous base layer is formed by accumulation of rejected nanoparticles preceding the freezing front and differs from previous reports in that no ice lenses are observed. X-ray micro-computed tomography reveals a uniform aligned pore structure vertically through the monolith. The pores close to the periphery are oriented radially or as chords, while the center region contains domains of parallel pores/lamellae. The domains are randomly oriented to one another, as already reported for regular freeze casting. This technique for directional freezing is convenient and easy to perform, but requires further refinement in that the temperature gradient and freezing rates remain yet to be measured. Also, control of the temperature gradient by varying chamber vacuum and shelf temperature needs to be evaluated.

  4. An efficient parallel algorithm for O(N^2) direct summation method and its variations on distributed-memory parallel machines

    CERN Document Server

    Makino, J

    2001-01-01

    We present a novel, highly efficient algorithm to parallelize O(N^2) direct summation method for N-body problems with individual timesteps on distributed-memory parallel machines such as Beowulf clusters. Previously known algorithms, in which all processors have complete copies of the N-body system, has the serious problem that the communication-computation ratio increases as we increase the number of processors, since the communication cost is independent of the number of processors. In the new algorithm, p processors are organized as a $\\sqrt{p}\\times \\sqrt{p}$ two-dimensional array. Each processor has $N/\\sqrt{p}$ particles, but the data are distributed in such a way that complete system is presented if we look at any row or column consisting of $\\sqrt{p}$ processors. In this algorithm, the communication cost scales as $N /\\sqrt{p}$, while the calculation cost scales as $N^2/p$. Thus, we can use a much larger number of processors without losing efficiency compared to what was practical with previously know...

  5. Preparation of imprinted monolithic column under molecular crowding conditions

    Institute of Scientific and Technical Information of China (English)

    Xiao Xia Li; Xin Liu; Li Hong Bai; Hong Quan Duan; Yan Ping Huang; Zhao Sheng Liu

    2011-01-01

    Molecular crowding is a new concept to obtain molecularly imprinted polymers (MIPs) with greater capacity and selectivity. In this work, molecular crowding agent was firstly applied to the preparation of MIPs monolithic column. A new polymerization system based on molecular crowding surrounding was developed to prepare enrofloxacin-imprinted monolith, which was composed of polystyrene and tetrahydrofuran. The result showed that the monolithic MIPs under molecular crowding conditions presented good molecular recognition for enrofloxacin with an imprinting factor of 3.03.

  6. Parallel Lines

    Directory of Open Access Journals (Sweden)

    James G. Worner

    2017-05-01

    Full Text Available James Worner is an Australian-based writer and scholar currently pursuing a PhD at the University of Technology Sydney. His research seeks to expose masculinities lost in the shadow of Australia’s Anzac hegemony while exploring new opportunities for contemporary historiography. He is the recipient of the Doctoral Scholarship in Historical Consciousness at the university’s Australian Centre of Public History and will be hosted by the University of Bologna during 2017 on a doctoral research writing scholarship.   ‘Parallel Lines’ is one of a collection of stories, The Shapes of Us, exploring liminal spaces of modern life: class, gender, sexuality, race, religion and education. It looks at lives, like lines, that do not meet but which travel in proximity, simultaneously attracted and repelled. James’ short stories have been published in various journals and anthologies.

  7. Finite difference programs and array processors. [High-speed floating point processing by coupling host computer to programable array processor

    Energy Technology Data Exchange (ETDEWEB)

    Rudy, T.E.

    1977-08-01

    An alternative to maxi computers for high-speed floating-point processing capabilities is the coupling of a host computer to a programable array processor. This paper compares the performance of two finite difference programs on various computers and their expected performance on commercially available array processors. The significance of balancing array processor computation, host-array processor control traffic, and data transfer operations is emphasized. 3 figures, 1 table.

  8. Cache Energy Optimization Techniques For Modern Processors

    Energy Technology Data Exchange (ETDEWEB)

    Mittal, Sparsh [ORNL

    2013-01-01

    Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In this book, we present novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. Also, we present cache energy saving techniques for caches designed with both conventional SRAM devices and emerging non-volatile devices such as STT-RAM (spin-torque transfer RAM). We present software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with state-of-the-art techniques and have found that our techniques outperform them in terms of energy efficiency and other relevant metrics. The techniques presented in this book have important applications in improving energy-efficiency of higher-end embedded, desktop, QoS, real-time, server processors and multitasking systems. This book is intended to be a valuable guide for both

  9. Parallelization of Kinetic Theory Simulations

    CERN Document Server

    Howell, Jim; Colbry, Dirk; Pickett, Rodney; Staber, Alec; Sagert, Irina; Strother, Terrance

    2013-01-01

    Numerical studies of shock waves in large scale systems via kinetic simulations with millions of particles are too computationally demanding to be processed in serial. In this work we focus on optimizing the parallel performance of a kinetic Monte Carlo code for astrophysical simulations such as core-collapse supernovae. Our goal is to attain a flexible program that scales well with the architecture of modern supercomputers. This approach requires a hybrid model of programming that combines a message passing interface (MPI) with a multithreading model (OpenMP) in C++. We report on our approach to implement the hybrid design into the kinetic code and show first results which demonstrate a significant gain in performance when many processors are applied.

  10. Data-parallel tomographic reconstruction : A comparison of filtered backprojection and direct Fourier reconstruction

    NARCIS (Netherlands)

    Roerdink, Jos B.T.M.; Westenberg, Michel A.

    1998-01-01

    We consider the parallelization of two standard 2D reconstruction algorithms, filtered backprojection and direct Fourier reconstruction, using the data-parallel programming style. The algorithms are implemented on a Connection Machine CM-5 with 16 processors and a peak performance of 2 Gflop/s. (C)

  11. Resource efficiency of hardware extensions of a 4-issue VLIW processor for elliptic curve cryptography

    Science.gov (United States)

    Jungeblut, T.; Puttmann, C.; Dreesen, R.; Porrmann, M.; Thies, M.; Rückert, U.; Kastens, U.

    2010-12-01

    The secure transmission of data plays a significant role in today's information era. Especially in the area of public-key-cryptography methods, which are based on elliptic curves (ECC), gain more and more importance. Compared to asymmetric algorithms, like RSA, ECC can be used with shorter key lengths, while achieving an equal level of security. The performance of ECC-algorithms can be increased significantly by adding application specific hardware extensions. Due to their fine grained parallelism, VLIW-processors are well suited for the execution of ECC algorithms. In this work, we extended the fourfold parallel CoreVA-VLIW-architecture by several hardware accelerators to increase the resource efficiency of the overall system. For the design-space exploration we use a dual design flow, which is based on the automatic generation of a complete C-compiler based tool chain from a central processor specification. Using the hardware accelerators the performance of the scalar multiplication on binary fields can be increased by the factor of 29. The energy consumption can be reduced by up to 90%. The extended processor hardware was mapped on a current 65 nm low-power standard-cell-technology. The chip area of the CoreVA-VLIW-architecture is 0.24 mm2 at a power consumption of 29 mW/MHz. The performance gain is analyzed in respect to the increased hardware costs, as chip area or power consumption.

  12. A Hybrid Scheme Based on Pipelining and Multitasking in Mobile Application Processors for Advanced Video Coding

    Directory of Open Access Journals (Sweden)

    Muhammad Asif

    2015-01-01

    Full Text Available One of the key requirements for mobile devices is to provide high-performance computing at lower power consumption. The processors used in these devices provide specific hardware resources to handle computationally intensive video processing and interactive graphical applications. Moreover, processors designed for low-power applications may introduce limitations on the availability and usage of resources, which present additional challenges to the system designers. Owing to the specific design of the JZ47x series of mobile application processors, a hybrid software-hardware implementation scheme for H.264/AVC encoder is proposed in this work. The proposed scheme distributes the encoding tasks among hardware and software modules. A series of optimization techniques are developed to speed up the memory access and data transferring among memories. Moreover, an efficient data reusage design is proposed for the deblock filter video processing unit to reduce the memory accesses. Furthermore, fine grained macroblock (MB level parallelism is effectively exploited and a pipelined approach is proposed for efficient utilization of hardware processing cores. Finally, based on parallelism in the proposed design, encoding tasks are distributed between two processing cores. Experiments show that the hybrid encoder is 12 times faster than a highly optimized sequential encoder due to proposed techniques.

  13. Dynamic Allocation of CPUs in Multicore Processor for Performance Improvement in Network Security Applications

    Directory of Open Access Journals (Sweden)

    Sudhakar Gummadi

    2011-01-01

    Full Text Available Problem statement: Multicore and multithreaded CPUs have become the new approach for increase in the performance of the processor based systems. Numerous applications benefit from use of multiple cores. Increasing performance of the system by increasing the number of CPUs of the multicore processor for a given application warrants detailed experimentation. In this study, the results of the experimentation done by dynamic allocation/deallocation of the CPU based on the workload conditions for the packet processing for security application are analyzed and presented. Approach: This evaluation was conducted on SunfireT1000 server having Sun UltraSPARC T1 multicore processor. OpenMP tasking feature is used for scheduling the logical CPUs for the parallelized application. Dynamic allocation of a CPU to a process is done depending on the workload characterization. Results: Execution time for packet processing was analyzed to arrive at an effective dynamic allocation methodology that is dependant on the hardware and the workload. Conclusion/Recommendations: Based on the analysis, the methodology and the allocation of the number of CPUs for the parallelized application are suggested.

  14. Fast Optimal Replica Placement with Exhaustive Search Using Dynamically Reconfigurable Processor

    Directory of Open Access Journals (Sweden)

    Hidetoshi Takeshita

    2011-01-01

    Full Text Available This paper proposes a new replica placement algorithm that expands the exhaustive search limit with reasonable calculation time. It combines a new type of parallel data-flow processor with an architecture tuned for fast calculation. The replica placement problem is to find a replica-server set satisfying service constraints in a content delivery network (CDN. It is derived from the set cover problem which is known to be NP-hard. It is impractical to use exhaustive search to obtain optimal replica placement in large-scale networks, because calculation time increases with the number of combinations. To reduce calculation time, heuristic algorithms have been proposed, but it is known that no heuristic algorithm is assured of finding the optimal solution. The proposed algorithm suits parallel processing and pipeline execution and is implemented on DAPDNA-2, a dynamically reconfigurable processor. Experiments show that the proposed algorithm expands the exhaustive search limit by the factor of 18.8 compared to the conventional algorithm search limit running on a Neumann-type processor.

  15. Monolithic Lumped Element Integrated Circuit (M2LEIC) Transistors.

    Science.gov (United States)

    INTEGRATED CIRCUITS, *MONOLITHIC STRUCTURES(ELECTRONICS), *TRANSISTORS, CHIPS(ELECTRONICS), FABRICATION, EPITAXIAL GROWTH, ULTRAHIGH FREQUENCY, POLYSILICONS, PHOTOLITHOGRAPHY, RADIOFREQUENCY POWER, IMPEDANCE MATCHING .

  16. Integer-encoded massively parallel processing of fast-learning fuzzy ARTMAP neural networks

    Science.gov (United States)

    Bahr, Hubert A.; DeMara, Ronald F.; Georgiopoulos, Michael

    1997-04-01

    In this paper we develop techniques that are suitable for the parallel implementation of Fuzzy ARTMAP networks. Speedup and learning performance results are provided for execution on a DECmpp/Sx-1208 parallel processor consisting of a DEC RISC Workstation Front-End and MasPar MP-1 Back-End with 8,192 processors. Experiments of the parallel implementation were conducted on the Letters benchmark database developed by Frey and Slate. The results indicate a speedup on the order of 1000-fold which allows combined training and testing time of under four minutes.

  17. Iterative methods for the WLS state estimation on RISC, vector, and parallel computers

    Energy Technology Data Exchange (ETDEWEB)

    Nieplocha, J. [Pacific Northwest Lab., Richland, WA (United States); Carroll, C.C. [Alabama Univ., University, AL (United States)

    1993-10-01

    We investigate the suitability and effectiveness of iterative methods for solving the weighted-least-square (WLS) state estimation problem on RISC, vector, and parallel processors. Several of the most popular iterative methods are tested and evaluated. The best performing preconditioned conjugate gradient (PCG) is very well suited for vector and parallel processing as is demonstrated for the WLS state estimation of the IEEE standard test systems. A new sparse matrix format for the gain matrix improves vector performance of the PCG algorithm and makes it competitive to the direct solver. Internal parallelism in RISC processors, used in current multiprocessor systems, can be taken advantage of in an implementation of this algorithm.

  18. Novel parallel architectures and algorithms for linear algebra processing. Semiannual report

    Energy Technology Data Exchange (ETDEWEB)

    Casasent, D.

    1986-10-01

    Advanced problems in computational fluid dynamics, finite element structural analysis, and related areas require the solution of large partial differential equations and matrices of large size and dynamic range. This project considers an advanced parallel linear algebra processor and associated novel parallel algorithms for such applications. Research on system fabrication, quantitative performance evaluation and new parallel algorithms are described and considered. Case studies in structural mechanics, dynamics and nonlinear systems, finite element methods, computational fluid dynamics, and partial differential equations are included. The novel utilization of an optical processor for the processing of such problems is the major research given attention.

  19. Parallel embedded systems: where real-time and low-power meet

    DEFF Research Database (Denmark)

    Karakehayov, Zdravko; Guo, Yu

    2008-01-01

    This paper introduces a combination of models and proofs for optimal power management via Dynamic Frequency Scaling and Dynamic Voltage Scaling. The approach is suitable for systems on a chip or microcontrollers where processors run in parallel with embedded peripherals. We have developed...... a software tool, called CASTLE, to provide computer assistance in the design process of energy-aware embedded systems. The tool considers single processor and parallel architectures. An example shows an energy reduction of 23% when the tool allocates two microcontrollers for parallel execution....

  20. SEQUENTIAL AND PARALLEL ALGORITHM TO FIND MAXIMUM FLOW ON EXTENDED MIXED NETWORKS BY REVISED POSTFLOW-PULL METHODS

    Directory of Open Access Journals (Sweden)

    Lau Nguyen Dinh

    2016-01-01

    Full Text Available The problem of finding maximum flow in network graph is extremely interesting and practically applicable in many fields in our daily life, especially in transportation. Therefore, a lot of researchers have been studying this problem in various methods. Especially in 2013, we has developed a new algorithm namely, postflow-pull algorithm to find the maximum flow on traditional networks. In this paper, we revised postflow-push methods to solve this problem of finding maximum flow on extended mixed network. In addition, to take more advantage of multi-core architecture of the parallel computing system, we build this parallel algorithm. This is a completely new method not being announced in the world. The results of this paper are basically systematized and proven. The idea of this algorithm is using multi processors to work in parallel by postflow_push algorithm. Among these processors, there is one main processor managing data, sending data to the sub processors, receiving data from the sub-processors. The sub-processors simultaneously execute their work and send their data to the main processor until the job is finished, the main processor will show the results of the problem.

  1. Multi-Core Processor Memory Contention Benchmark Analysis Case Study

    Science.gov (United States)

    Simon, Tyler; McGalliard, James

    2009-01-01

    Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.

  2. Multiple core computer processor with globally-accessible local memories

    Energy Technology Data Exchange (ETDEWEB)

    Shalf, John; Donofrio, David; Oliker, Leonid

    2016-09-20

    A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality of processor cores.

  3. Design of Processors with Reconfigurable Microarchitecture

    Directory of Open Access Journals (Sweden)

    Andrey Mokhov

    2014-01-01

    Full Text Available Energy becomes a dominating factor for a wide spectrum of computations: from intensive data processing in “big data” companies resulting in large electricity bills, to infrastructure monitoring with wireless sensors relying on energy harvesting. In this context it is essential for a computation system to be adaptable to the power supply and the service demand, which often vary dramatically during runtime. In this paper we present an approach to building processors with reconfigurable microarchitecture capable of changing the way they fetch and execute instructions depending on energy availability and application requirements. We show how to use Conditional Partial Order Graphs to formally specify the microarchitecture of such a processor, explore the design possibilities for its instruction set, and synthesise the instruction decoder using correct-by-construction techniques. The paper is focused on the design methodology, which is evaluated by implementing a power-proportional version of Intel 8051 microprocessor.

  4. Multifunction nonlinear signal processor - Deconvolution and correlation

    Science.gov (United States)

    Javidi, Bahram; Horner, Joseph L.

    1989-08-01

    A multifuncional nonlinear optical signal processor is described that allows different types of operations, such as image deconvolution and nonlinear correlation. In this technique, the joint power spectrum of the input signal is thresholded with varying nonlinearity to produce different specific operations. In image deconvolution, the joint power spectrum is modified and hard-clip thresholded to remove the amplitude distortion effects and to restore the correct phase of the original image. In optical correlation, the Fourier transform interference intensity is thresholded to provide higher correlation peak intensity and a better-defined correlation spot. Various types of correlation signals can be produced simply by varying the severity of the nonlinearity, without the need for synthesis of specific matched filter. An analysis of the nonlinear processor for image deconvolution is presented.

  5. Model of computation for Fourier optical processors

    Science.gov (United States)

    Naughton, Thomas J.

    2000-05-01

    We present a novel and simple theoretical model of computation that captures what we believe are the most important characteristics of an optical Fourier transform processor. We use this abstract model to reason about the computational properties of the physical systems it describes. We define a grammar for our model's instruction language, and use it to write algorithms for well-known filtering and correlation techniques. We also suggest suitable computational complexity measures that could be used to analyze any coherent optical information processing technique, described with the language, for efficiency. Our choice of instruction language allows us to argue that algorithms describable with this model should have optical implementations that do not require a digital electronic computer to act as a master unit. Through simulation of a well known model of computation from computer theory we investigate the general-purpose capabilities of analog optical processors.

  6. Communication Efficient Multi-processor FFT

    Science.gov (United States)

    Lennart Johnsson, S.; Jacquemin, Michel; Krawitz, Robert L.

    1992-10-01

    Computing the fast Fourier transform on a distributed memory architecture by a direct pipelined radix-2, a bi-section, or a multisection algorithm, all yield the same communications requirement, if communication for all FFT stages can be performed concurrently, the input data is in normal order, and the data allocation is consecutive. With a cyclic data allocation, or bit-reversed input data and a consecutive allocation, multi-sectioning offers a reduced communications requirement by approximately a factor of two. For a consecutive data allocation, normal input order, a decimation-in-time FFT requires that P/ N + d-2 twiddle factors be stored for P elements distributed evenly over N processors, and the axis that is subject to transformation be distributed over 2 d processors. No communication of twiddle factors is required. The same storage requirements hold for a decimation-in-frequency FFT, bit-reversed input order, and consecutive data allocation. The opposite combination of FFT type and data ordering requires a factor of log 2N more storage for N processors. The peak performance for a Connection Machine system CM-200 implementation is 12.9 Gflops/s in 32-bit precision, and 10.7 Gflops/s in 64-bit precision for unordered transforms local to each processor. The corresponding execution rates for ordered transforms are 11.1 Gflops/s and 8.5 Gflops/s, respectively. For distributed one- and two-dimensional transforms the peak performance for unordered transforms exceeds 5 Gflops/s in 32-bit precision and 3 Gflops/s in 64-bit precision. Three-dimensional transforms execute at a slightly lower rate. Distributed ordered transforms execute at a rate of about {1}/{2}to {2}/{3} of the unordered transforms.

  7. Breadboard Signal Processor for Arraying DSN Antennas

    Science.gov (United States)

    Jongeling, Andre; Sigman, Elliott; Chandra, Kumar; Trinh, Joseph; Soriano, Melissa; Navarro, Robert; Rogstad, Stephen; Goodhart, Charles; Proctor, Robert; Jourdan, Michael; hide

    2008-01-01

    A recently developed breadboard version of an advanced signal processor for arraying many antennas in NASA s Deep Space Network (DSN) can accept inputs in a 500-MHz-wide frequency band from six antennas. The next breadboard version is expected to accept inputs from 16 antennas, and a following developed version is expected to be designed according to an architecture that will be scalable to accept inputs from as many as 400 antennas. These and similar signal processors could also be used for combining multiple wide-band signals in non-DSN applications, including very-long-baseline interferometry and telecommunications. This signal processor performs functions of a wide-band FX correlator and a beam-forming signal combiner. [The term "FX" signifies that the digital samples of two given signals are fast Fourier transformed (F), then the fast Fourier transforms of the two signals are multiplied (X) prior to accumulation.] In this processor, the signals from the various antennas are broken up into channels in the frequency domain (see figure). In each frequency channel, the data from each antenna are correlated against the data from each other antenna; this is done for all antenna baselines (that is, for all antenna pairs). The results of the correlations are used to obtain calibration data to align the antenna signals in both phase and delay. Data from the various antenna frequency channels are also combined and calibration corrections are applied. The frequency-domain data thus combined are then synthesized back to the time domain for passing on to a telemetry receiver

  8. A post-processor for Gurmukhi OCR

    Indian Academy of Sciences (India)

    G S Lehal; Chandan Singh

    2002-02-01

    A post-processing system for OCR of Gurmukhi script has been developed. Statistical information of Punjabi language syllable combinations, corpora look-up and certain heuristics based on Punjabi grammar rules have been combined to design the post-processor. An improvement of 3% in recognition rate, from 94.35% to 97.34%, has been reported on clean images using the post-processing techniques.

  9. Modules for Pipelined Mixed Radix FFT Processors

    Directory of Open Access Journals (Sweden)

    Anatolij Sergiyenko

    2016-01-01

    Full Text Available A set of soft IP cores for the Winograd r-point fast Fourier transform (FFT is considered. The cores are designed by the method of spatial SDF mapping into the hardware, which provides the minimized hardware volume at the cost of slowdown of the algorithm by r times. Their clock frequency is equal to the data sampling frequency. The cores are intended for the high-speed pipelined FFT processors, which are implemented in FPGA.

  10. High-pressure coal fuel processor development

    Energy Technology Data Exchange (ETDEWEB)

    Greenhalgh, M.L.

    1992-11-01

    The objective of Subtask 1.1 Engine Feasibility was to conduct research needed to establish the technical feasibility of ignition and stable combustion of directly injected, 3,000 psi, low-Btu gas with glow plug ignition assist at diesel engine compression ratios. This objective was accomplished by designing, fabricating, testing and analyzing the combustion performance of synthesized low-Btu coal gas in a single-cylinder test engine combustion rig located at the Caterpillar Technical Center engine lab in Mossville, Illinois. The objective of Subtask 1.2 Fuel Processor Feasibility was to conduct research needed to establish the technical feasibility of air-blown, fixed-bed, high-pressure coal fuel processing at up to 3,000 psi operating pressure, incorporating in-bed sulfur and particulate capture. This objective was accomplished by designing, fabricating, testing and analyzing the performance of bench-scale processors located at Coal Technology Corporation (subcontractor) facilities in Bristol, Virginia. These two subtasks were carried out at widely separated locations and will be discussed in separate sections of this report. They were, however, independent in that the composition of the synthetic coal gas used to fuel the combustion rig was adjusted to reflect the range of exit gas compositions being produced on the fuel processor rig. Two major conclusions resulted from this task. First, direct injected, ignition assisted Diesel cycle engine combustion systems can be suitably modified to efficiently utilize these low-Btu gas fuels. Second, high pressure gasification of selected run-of-the-mine coals in batch-loaded fuel processors is feasible. These two findings, taken together, significantly reduce the perceived technical risks associated with the further development of the proposed coal gas fueled Diesel cycle power plant concept.

  11. Keystone Business Models for Network Security Processors

    Directory of Open Access Journals (Sweden)

    Arthur Low

    2013-07-01

    Full Text Available Network security processors are critical components of high-performance systems built for cybersecurity. Development of a network security processor requires multi-domain experience in semiconductors and complex software security applications, and multiple iterations of both software and hardware implementations. Limited by the business models in use today, such an arduous task can be undertaken only by large incumbent companies and government organizations. Neither the “fabless semiconductor” models nor the silicon intellectual-property licensing (“IP-licensing” models allow small technology companies to successfully compete. This article describes an alternative approach that produces an ongoing stream of novel network security processors for niche markets through continuous innovation by both large and small companies. This approach, referred to here as the "business ecosystem model for network security processors", includes a flexible and reconfigurable technology platform, a “keystone” business model for the company that maintains the platform architecture, and an extended ecosystem of companies that both contribute and share in the value created by innovation. New opportunities for business model innovation by participating companies are made possible by the ecosystem model. This ecosystem model builds on: i the lessons learned from the experience of the first author as a senior integrated circuit architect for providers of public-key cryptography solutions and as the owner of a semiconductor startup, and ii the latest scholarly research on technology entrepreneurship, business models, platforms, and business ecosystems. This article will be of interest to all technology entrepreneurs, but it will be of particular interest to owners of small companies that provide security solutions and to specialized security professionals seeking to launch their own companies.

  12. Algorithmic support for commodity-based parallel computing systems.

    Energy Technology Data Exchange (ETDEWEB)

    Leung, Vitus Joseph; Bender, Michael A. (State University of New York, Stony Brook, NY); Bunde, David P. (University of Illinois, Urbna, IL); Phillips, Cynthia Ann

    2003-10-01

    The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.

  13. SKIRT: Hybrid parallelization of radiative transfer simulations

    Science.gov (United States)

    Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.

    2017-07-01

    We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.

  14. Electric Propulsion Plume Simulations Using Parallel Computer

    Directory of Open Access Journals (Sweden)

    Joseph Wang

    2007-01-01

    Full Text Available A parallel, three-dimensional electrostatic PIC code is developed for large-scale electric propulsion simulations using parallel supercomputers. This code uses a newly developed immersed-finite-element particle-in-cell (IFE-PIC algorithm designed to handle complex boundary conditions accurately while maintaining the computational speed of the standard PIC code. Domain decomposition is used in both field solve and particle push to divide the computation among processors. Two simulations studies are presented to demonstrate the capability of the code. The first is a full particle simulation of near-thruster plume using real ion to electron mass ratio. The second is a high-resolution simulation of multiple ion thruster plume interactions for a realistic spacecraft using a domain enclosing the entire solar array panel. Performance benchmarks show that the IFE-PIC achieves a high parallel efficiency of ≥ 90%

  15. High-temperature compressive deformation of Si{sub 3}N{sub 4}/BN fibrous monoliths.

    Energy Technology Data Exchange (ETDEWEB)

    Routbort, J. L.

    1999-02-04

    Fibrous monolithic Si{sub 3}N{sub 4}/BN ({approx}85 vol.% Si{sub 3}N{sub 4}/15 vol.% BN) and monolithic Si{sub 3}N{sub 4} ceramics were compressed at a nearly constant strain rate ({var_epsilon}) at 1200-1400 C in N{sub 2}. The {var_epsilon} range was {approx}1 x 10{sup {minus}6} to 5 x 10{sup {minus}6} s{sup {minus}1}; the stress ({sigma}) range was 37-202 MPa. The Si{sub 3}N{sub 4} and the unidirectional fibrous monoliths that were oriented with the long axis of the Si{sub 3}N{sub 4} cells parallel to the compression direction exhibited plasticity at 1300 and 1400 C, with {var_epsilon} {proportional_to} {sigma}. A 0/90{degree} cross-ply Si{sub 3}N{sub 4}/BN laminate also exhibited significant plasticity, but it was weaker than the above-mentioned ceramics. The unidirectional fibrous monoliths that were compressed perpendicular to the cell direction fractured at {approx}50 MPa in all tests. A {+-}45{degree} laminate tested at 1300 C fractured at a stress of {approx}40 MPa. Low fracture stress correlated with shear through BN layers.

  16. Intelligent trigger processor for the crystal box

    CERN Document Server

    Sanders, G H; Cooper, M D; Hart, G W; Hoffman, C M; Hogan, G E; Hughes, E B; Matis, H S; Rolfe, J; Sandberg, V D; Williams, R A; Wilson, S; Zeman, H

    1981-01-01

    A large solid angle angular modular NaI(Tl) detector with 432 phototubes and 88 trigger scintillators is being used to search simultaneously for three lepton flavor-changing decays of the muon. A beam of up to 10/sup 6/ muons stopping per second with a 6% duty factor would yield up to 1000 triggers per second from random triple coincidences. A reduction of the trigger rate to 10 Hz is required from a hardwired primary trigger processor. Further reduction to <1 Hz is achieved by a microprocessor-based secondary trigger processor. The primary trigger hardware imposes voter coincidence logic, stringent timing requirements, and a non-adjacency requirement in the trigger scintillators defined by hardwired circuits. Sophisticated geometric requirements are imposed by a PROM-based matrix logic, and energy and vector-momentum cuts are imposed by a hardwired processor using LSI flash ADC's and digital arithmetic logic. The secondary trigger employs four satellite microprocessors to do a sparse data scan, multiplex ...

  17. Software-Reconfigurable Processors for Spacecraft

    Science.gov (United States)

    Farrington, Allen; Gray, Andrew; Bell, Bryan; Stanton, Valerie; Chong, Yong; Peters, Kenneth; Lee, Clement; Srinivasan, Jeffrey

    2005-01-01

    A report presents an overview of an architecture for a software-reconfigurable network data processor for a spacecraft engaged in scientific exploration. When executed on suitable electronic hardware, the software performs the functions of a physical layer (in effect, acts as a software radio in that it performs modulation, demodulation, pulse-shaping, error correction, coding, and decoding), a data-link layer, a network layer, a transport layer, and application-layer processing of scientific data. The software-reconfigurable network processor is undergoing development to enable rapid prototyping and rapid implementation of communication, navigation, and scientific signal-processing functions; to provide a long-lived communication infrastructure; and to provide greatly improved scientific-instrumentation and scientific-data-processing functions by enabling science-driven in-flight reconfiguration of computing resources devoted to these functions. This development is an extension of terrestrial radio and network developments (e.g., in the cellular-telephone industry) implemented in software running on such hardware as field-programmable gate arrays, digital signal processors, traditional digital circuits, and mixed-signal application-specific integrated circuits (ASICs).

  18. Efficient searching and sorting applications using an associative array processor

    Science.gov (United States)

    Pace, W.; Quinn, M. J.

    1978-01-01

    The purpose of this paper is to describe a method of searching and sorting data by using some of the unique capabilities of an associative array processor. To understand the application, the associative array processor is described in detail. In particular, the content addressable memory and flip network are discussed because these two unique elements give the associative array processor the power to rapidly sort and search. A simple alphanumeric sorting example is explained in hardware and software terms. The hardware used to explain the application is the STARAN (Goodyear Aerospace Corporation) associative array processor. The software used is the APPLE (Array Processor Programming Language) programming language. Some applications of the array processor are discussed. This summary tries to differentiate between the techniques of the sequential machine and the associative array processor.

  19. Testing and operating a multiprocessor chip with processor redundancy

    Energy Technology Data Exchange (ETDEWEB)

    Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J

    2014-10-21

    A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.

  20. VLSI based FFT Processor with Improvement in Computation Speed and Area Reduction

    Directory of Open Access Journals (Sweden)

    M.Sheik Mohamed

    2013-06-01

    Full Text Available In this paper, a modular approach is presented to develop parallel pipelined architectures for the fast Fourier transform (FFT processor. The new pipelined FFT architecture has the advantage of underutilized hardware based on the complex conjugate of final stage results without increasing the hardware complexity. The operating frequency of the new architecture can be decreased that in turn reduces the power consumption. A comparison of area and computing time are drawn between the new design and the previous architectures. The new structure is synthesized using Xilinx ISE and simulated using ModelSim Starter Edition. The designed FFT algorithm is realized in our processor to reduce the number of complex computations.