Matrix factorization on a hypercube multiprocessor
Geist, G.A.; Heath, M.T.
1985-08-01
This paper is concerned with parallel algorithms for matrix factorization on distributed-memory, message-passing multiprocessors, with special emphasis on the hypercube. Both Cholesky factorization of symmetric positive definite matrices and LU factorization of nonsymmetric matrices using partial pivoting are considered. The use of the resulting triangular factors to solve systems of linear equations by forward and back substitutions is also considered. Efficiencies of various parallel computational approaches are compared in terms of empirical results obtained on an Intel iPSC hypercube. 19 refs., 6 figs., 2 tabs
An efficient communication scheme for solving Sn equations on message-passing multiprocessors
Azmy, Y.Y.
1993-01-01
Early models of Intel's hypercube multiprocessors, e.g., the iPSC/1 and iPSC/2, were characterized by the high latency of message passing. This relatively weak dependence of the communication penalty on the size of messages, in contrast to its strong dependence on the number of messages, justified using the Fan-in Fan-out algorithm (which implements a minimum spanning tree path) to perform global operations, such as global sums, etc. Recent models of message-passing computers, such as the iPSC/860 and the Paragon, have been found to possess much smaller latency, thus forcing a reexamination of the issue of performance optimization with respect to communication schemes. Essentially, the Fan-in Fan-out scheme minimizes the number of nonsimultaneous messages sent but not the volume of data traffic across the network. Furthermore, if a global operation is performed in conjunction with the message passing, a large fraction of the attached nodes remains idle as the number of utilized processors is halved in each step of the process. On the other hand, the Recursive Halving scheme offers the smallest communication cost for global operations but has some drawbacks
Parallel simulated annealing algorithms for cell placement on hypercube multiprocessors
Banerjee, Prithviraj; Jones, Mark Howard; Sargent, Jeff S.
1990-01-01
Two parallel algorithms for standard cell placement using simulated annealing are developed to run on distributed-memory message-passing hypercube multiprocessors. The cells can be mapped in a two-dimensional area of a chip onto processors in an n-dimensional hypercube in two ways, such that both small and large cell exchange and displacement moves can be applied. The computation of the cost function in parallel among all the processors in the hypercube is described, along with a distributed data structure that needs to be stored in the hypercube to support the parallel cost evaluation. A novel tree broadcasting strategy is used extensively for updating cell locations in the parallel environment. A dynamic parallel annealing schedule estimates the errors due to interacting parallel moves and adapts the rate of synchronization automatically. Two novel approaches in controlling error in parallel algorithms are described: heuristic cell coloring and adaptive sequence control.
Parallel algorithms for geometric connected component labeling on a hypercube multiprocessor
Belkhale, K. P.; Banerjee, P.
1992-01-01
Different algorithms for the geometric connected component labeling (GCCL) problem are defined each of which involves d stages of message passing, for a d-dimensional hypercube. The major idea is that in each stage a hypercube multiprocessor increases its knowledge of domain. The algorithms under consideration include the QUAD algorithm for small number of processors and the Overlap Quad algorithm for large number of processors, subject to the locality of the connected sets. These algorithms differ in their run time, memory requirements, and message complexity. They were implemented on an Intel iPSC2/D4/MX hypercube.
A parallel implementation of 3-d CT image reconstruction on a hypercube multiprocessor
Chen, C.M.; Lee, S.Y.; Cho, Z.H.
1990-01-01
In this paper, the authors describe how image reconstruction in computerized tomography (CT) can be parallelized on a message-passing multiprocessor. In particular, the results obtained from parallel implementation of 3-D CT image reconstruction for parallel beam geometries on the Intel hypercube, iPSC/2, are presented. A two stage pipelining approach is employed for filtering (convolution) and backprojection. The conventional sequential convolution algorithm is modified such that the symmetry of the filter kernel is fully utilized for parallelization. In the backprojection stage, the 3-D incremental algorithm, the authors' recently developed backprojection scheme which is shown to be faster than conventional algorithm, is parallelized
Azmy, Y.Y.
1997-01-01
The effect of three communication schemes for solving Arbitrarily High Order Transport (AHOT) methods of the Nodal type on parallel performance is examined via direct measurements and performance models. The target architecture in this study is Oak Ridge National Laboratory's 128 node Paragon XP/S 5 computer and the parallelization is based on the Parallel Virtual Machine (PVM) library. However, the conclusions reached can be easily generalized to a large class of message passing platforms and communication software. The three schemes considered here are: (1) PVM's global operations (broadcast and reduce) which utilizes the Paragon's native corresponding operations based on a spanning tree routing; (2) the Bucket algorithm wherein the angular domain decomposition of the mesh sweep is complemented with a spatial domain decomposition of the accumulation process of the scalar flux from the angular flux and the convergence test; (3) a distributed memory version of the Bucket algorithm that pushes the spatial domain decomposition one step farther by actually distributing the fixed source and flux iterates over the memories of the participating processes. Their conclusion is that the Bucket algorithm is the most efficient of the three if all participating processes have sufficient memories to hold the entire problem arrays. Otherwise, the third scheme becomes necessary at an additional cost to speedup and parallel efficiency that is quantifiable via the parallel performance model
Sargent, Jeff Scott
1988-01-01
A new row-based parallel algorithm for standard-cell placement targeted for execution on a hypercube multiprocessor is presented. Key features of this implementation include a dynamic simulated-annealing schedule, row-partitioning of the VLSI chip image, and two novel new approaches to controlling error in parallel cell-placement algorithms; Heuristic Cell-Coloring and Adaptive (Parallel Move) Sequence Control. Heuristic Cell-Coloring identifies sets of noninteracting cells that can be moved repeatedly, and in parallel, with no buildup of error in the placement cost. Adaptive Sequence Control allows multiple parallel cell moves to take place between global cell-position updates. This feedback mechanism is based on an error bound derived analytically from the traditional annealing move-acceptance profile. Placement results are presented for real industry circuits and the performance is summarized of an implementation on the Intel iPSC/2 Hypercube. The runtime of this algorithm is 5 to 16 times faster than a previous program developed for the Hypercube, while producing equivalent quality placement. An integrated place and route program for the Intel iPSC/2 Hypercube is currently being developed.
A parallel algorithm for switch-level timing simulation on a hypercube multiprocessor
Rao, Hariprasad Nannapaneni
1989-01-01
The parallel approach to speeding up simulation is studied, specifically the simulation of digital LSI MOS circuitry on the Intel iPSC/2 hypercube. The simulation algorithm is based on RSIM, an event driven switch-level simulator that incorporates a linear transistor model for simulating digital MOS circuits. Parallel processing techniques based on the concepts of Virtual Time and rollback are utilized so that portions of the circuit may be simulated on separate processors, in parallel for as large an increase in speed as possible. A partitioning algorithm is also developed in order to subdivide the circuit for parallel processing.
Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor
Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.; Schaefer, H.F. III
1987-01-01
For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. The usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation
Message passing for quantified Boolean formulas
Zhang, Pan; Ramezanpour, Abolfazl; Zecchina, Riccardo; Zdeborová, Lenka
2012-01-01
We introduce two types of message passing algorithms for quantified Boolean formulas (QBF). The first type is a message passing based heuristics that can prove unsatisfiability of the QBF by assigning the universal variables in such a way that the remaining formula is unsatisfiable. In the second type, we use message passing to guide branching heuristics of a Davis–Putnam–Logemann–Loveland (DPLL) complete solver. Numerical experiments show that on random QBFs our branching heuristics give robust exponential efficiency gain with respect to state-of-the-art solvers. We also manage to solve some previously unsolved benchmarks from the QBFLIB library. Apart from this, our study sheds light on using message passing in small systems and as subroutines in complete solvers
Message Passing Framework for Globally Interconnected Clusters
Hafeez, M; Riaz, N; Asghar, S; Malik, U A; Rehman, A
2011-01-01
In prevailing technology trends it is apparent that the network requirements and technologies will advance in future. Therefore the need of High Performance Computing (HPC) based implementation for interconnecting clusters is comprehensible for scalability of clusters. Grid computing provides global infrastructure of interconnecting clusters consisting of dispersed computing resources over Internet. On the other hand the leading model for HPC programming is Message Passing Interface (MPI). As compared to Grid computing, MPI is better suited for solving most of the complex computational problems. MPI itself is restricted to a single cluster. It does not support message passing over the internet to use the computing resources of different clusters in an optimal way. We propose a model that provides message passing capabilities between parallel applications over the internet. The proposed model is based on Architecture for Java Universal Message Passing (A-JUMP) framework and Enterprise Service Bus (ESB) named as High Performance Computing Bus. The HPC Bus is built using ActiveMQ. HPC Bus is responsible for communication and message passing in an asynchronous manner. Asynchronous mode of communication offers an assurance for message delivery as well as a fault tolerance mechanism for message passing. The idea presented in this paper effectively utilizes wide-area intercluster networks. It also provides scheduling, dynamic resource discovery and allocation, and sub-clustering of resources for different jobs. Performance analysis and comparison study of the proposed framework with P2P-MPI are also presented in this paper.
Message passing with parallel queue traversal
Underwood, Keith D [Albuquerque, NM; Brightwell, Ronald B [Albuquerque, NM; Hemmert, K Scott [Albuquerque, NM
2012-05-01
In message passing implementations, associative matching structures are used to permit list entries to be searched in parallel fashion, thereby avoiding the delay of linear list traversal. List management capabilities are provided to support list entry turnover semantics and priority ordering semantics.
Blind sensor calibration using approximate message passing
Schülke, Christophe; Caltagirone, Francesco; Zdeborová, Lenka
2015-01-01
The ubiquity of approximately sparse data has led a variety of communities to take great interest in compressed sensing algorithms. Although these are very successful and well understood for linear measurements with additive noise, applying them to real data can be problematic if imperfect sensing devices introduce deviations from this ideal signal acquisition process, caused by sensor decalibration or failure. We propose a message passing algorithm called calibration approximate message passing (Cal-AMP) that can treat a variety of such sensor-induced imperfections. In addition to deriving the general form of the algorithm, we numerically investigate two particular settings. In the first, a fraction of the sensors is faulty, giving readings unrelated to the signal. In the second, sensors are decalibrated and each one introduces a different multiplicative gain to the measurements. Cal-AMP shares the scalability of approximate message passing, allowing us to treat large sized instances of these problems, and experimentally exhibits a phase transition between domains of success and failure. (paper)
TUMULT, the Twente University multiprocessor
Scholten, Johan; Jansen, P.G.
1988-01-01
TUMULT, (Twente University multiprocessor) is described. Its aim is the design and implementation of a modular extendable multiprocessor system. Up to 15 processing elements are connected through an interprocessor communication network, using message-passing for the exchange of data. The hardware is
Future-based Static Analysis of Message Passing Programs
Wytse Oortwijn
2016-06-01
Full Text Available Message passing is widely used in industry to develop programs consisting of several distributed communicating components. Developing functionally correct message passing software is very challenging due to the concurrent nature of message exchanges. Nonetheless, many safety-critical applications rely on the message passing paradigm, including air traffic control systems and emergency services, which makes proving their correctness crucial. We focus on the modular verification of MPI programs by statically verifying concrete Java code. We use separation logic to reason about local correctness and define abstractions of the communication protocol in the process algebra used by mCRL2. We call these abstractions futures as they predict how components will interact during program execution. We establish a provable link between futures and program code and analyse the abstract futures via model checking to prove global correctness. Finally, we verify a leader election protocol to demonstrate our approach.
The serial message-passing schedule for LDPC decoding algorithms
Liu, Mingshan; Liu, Shanshan; Zhou, Yuan; Jiang, Xue
2015-12-01
The conventional message-passing schedule for LDPC decoding algorithms is the so-called flooding schedule. It has the disadvantage that the updated messages cannot be used until next iteration, thus reducing the convergence speed . In this case, the Layered Decoding algorithm (LBP) based on serial message-passing schedule is proposed. In this paper the decoding principle of LBP algorithm is briefly introduced, and then proposed its two improved algorithms, the grouped serial decoding algorithm (Grouped LBP) and the semi-serial decoding algorithm .They can improve LBP algorithm's decoding speed while maintaining a good decoding performance.
EXIT Chart Analysis of Binary Message-Passing Decoders
Lechner, Gottfried; Pedersen, Troels; Kramer, Gerhard
2007-01-01
Binary message-passing decoders for LDPC codes are analyzed using EXIT charts. For the analysis, the variable node decoder performs all computations in the L-value domain. For the special case of a hard decision channel, this leads to the well know Gallager B algorithm, while the analysis can...... be extended to channels with larger output alphabets. By increasing the output alphabet from hard decisions to four symbols, a gain of more than 1.0 dB is achieved using optimized codes. For this code optimization, the mixing property of EXIT functions has to be modified to the case of binary message......-passing decoders....
Protocol-Based Verification of Message-Passing Parallel Programs
López-Acosta, Hugo-Andrés; Eduardo R. B. Marques, Eduardo R. B.; Martins, Francisco
2015-01-01
We present ParTypes, a type-based methodology for the verification of Message Passing Interface (MPI) programs written in the C programming language. The aim is to statically verify programs against protocol specifications, enforcing properties such as fidelity and absence of deadlocks. We develo...
Track-stitching using graphical models and message passing
Van der Merwe, LJ
2013-07-01
Full Text Available In order to stitch tracks together, two tasks are required, namely tracking and track stitching. In this study track stitching is performed using a graphical model and message passing (belief propagation) approach. Tracks are modelled as nodes in a...
A message passing algorithm for the evaluation of social influence
Vassio, Luca; Fagnani, Fabio; Frasca, Paolo; Ozdaglar, Asuman
2014-01-01
In this paper, we define a new measure of node centrality in social networks, the Harmonic Influence Centrality, which emerges naturally in the study of social influence over networks. Next, we introduce a distributed message passing algorithm to compute the Harmonic Influence Centrality of each
Analysis and Design of Binary Message-Passing Decoders
Lechner, Gottfried; Pedersen, Troels; Kramer, Gerhard
2012-01-01
Binary message-passing decoders for low-density parity-check (LDPC) codes are studied by using extrinsic information transfer (EXIT) charts. The channel delivers hard or soft decisions and the variable node decoder performs all computations in the L-value domain. A hard decision channel results...... message-passing decoders. Finally, it is shown that errors on cycles consisting only of degree two and three variable nodes cannot be corrected and a necessary and sufficient condition for the existence of a cycle-free subgraph is derived....... in the well-know Gallager B algorithm, and increasing the output alphabet from hard decisions to two bits yields a gain of more than 1.0 dB in the required signal to noise ratio when using optimized codes. The code optimization requires adapting the mixing property of EXIT functions to the case of binary...
S-AMP: Approximate Message Passing for General Matrix Ensembles
Cakmak, Burak; Winther, Ole; Fleury, Bernard H.
2014-01-01
the approximate message-passing (AMP) algorithm to general matrix ensembles with a well-defined large system size limit. The generalization is based on the S-transform (in free probability) of the spectrum of the measurement matrix. Furthermore, we show that the optimality of S-AMP follows directly from its......We propose a novel iterative estimation algorithm for linear observation models called S-AMP. The fixed points of S-AMP are the stationary points of the exact Gibbs free energy under a set of (first- and second-) moment consistency constraints in the large system limit. S-AMP extends...
Fault-tolerant Agreement in Synchronous Message-passing Systems
Raynal, Michel
2010-01-01
The present book focuses on the way to cope with the uncertainty created by process failures (crash, omission failures and Byzantine behavior) in synchronous message-passing systems (i.e., systems whose progress is governed by the passage of time). To that end, the book considers fundamental problems that distributed synchronous processes have to solve. These fundamental problems concern agreement among processes (if processes are unable to agree in one way or another in presence of failures, no non-trivial problem can be solved). They are consensus, interactive consistency, k-set agreement an
Parallelization of a hydrological model using the message passing interface
Wu, Yiping; Li, Tiejian; Sun, Liqun; Chen, Ji
2013-01-01
With the increasing knowledge about the natural processes, hydrological models such as the Soil and Water Assessment Tool (SWAT) are becoming larger and more complex with increasing computation time. Additionally, other procedures such as model calibration, which may require thousands of model iterations, can increase running time and thus further reduce rapid modeling and analysis. Using the widely-applied SWAT as an example, this study demonstrates how to parallelize a serial hydrological model in a Windows® environment using a parallel programing technology—Message Passing Interface (MPI). With a case study, we derived the optimal values for the two parameters (the number of processes and the corresponding percentage of work to be distributed to the master process) of the parallel SWAT (P-SWAT) on an ordinary personal computer and a work station. Our study indicates that model execution time can be reduced by 42%–70% (or a speedup of 1.74–3.36) using multiple processes (two to five) with a proper task-distribution scheme (between the master and slave processes). Although the computation time cost becomes lower with an increasing number of processes (from two to five), this enhancement becomes less due to the accompanied increase in demand for message passing procedures between the master and all slave processes. Our case study demonstrates that the P-SWAT with a five-process run may reach the maximum speedup, and the performance can be quite stable (fairly independent of a project size). Overall, the P-SWAT can help reduce the computation time substantially for an individual model run, manual and automatic calibration procedures, and optimization of best management practices. In particular, the parallelization method we used and the scheme for deriving the optimal parameters in this study can be valuable and easily applied to other hydrological or environmental models.
Weighted community detection and data clustering using message passing
Shi, Cheng; Liu, Yanchen; Zhang, Pan
2018-03-01
Grouping objects into clusters based on the similarities or weights between them is one of the most important problems in science and engineering. In this work, by extending message-passing algorithms and spectral algorithms proposed for an unweighted community detection problem, we develop a non-parametric method based on statistical physics, by mapping the problem to the Potts model at the critical temperature of spin-glass transition and applying belief propagation to solve the marginals corresponding to the Boltzmann distribution. Our algorithm is robust to over-fitting and gives a principled way to determine whether there are significant clusters in the data and how many clusters there are. We apply our method to different clustering tasks. In the community detection problem in weighted and directed networks, we show that our algorithm significantly outperforms existing algorithms. In the clustering problem, where the data were generated by mixture models in the sparse regime, we show that our method works all the way down to the theoretical limit of detectability and gives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clustering problem, our method only needs several labels to work perfectly in classic datasets. Finally, we further develop Thouless-Anderson-Palmer equations which heavily reduce the computation complexity in dense networks but give almost the same performance as belief propagation.
Ha, Jeongmok; Jeong, Hong
2016-07-01
This study investigates the directed acyclic subgraph (DAS) algorithm, which is used to solve discrete labeling problems much more rapidly than other Markov-random-field-based inference methods but at a competitive accuracy. However, the mechanism by which the DAS algorithm simultaneously achieves competitive accuracy and fast execution speed, has not been elucidated by a theoretical derivation. We analyze the DAS algorithm by comparing it with a message passing algorithm. Graphical models, inference methods, and energy-minimization frameworks are compared between DAS and message passing algorithms. Moreover, the performances of DAS and other message passing methods [sum-product belief propagation (BP), max-product BP, and tree-reweighted message passing] are experimentally compared.
High performance message passing for the ATLAS DAQ/EF-1 project
Mornacchi, Giuseppe
1999-01-01
Summary form only. A message passing library has been developed in the context of the ATLAS DAQ/EF-1 project. It is used for time critical applications within the front-end part of the DAQ system, mainly to exchange data control messages between I/O processors. Key objectives of the design were low message overheads, efficient use of the data transfer buses, provision of broadcast functionality and a hardware and operating system independent implementation of the application interface. The design and implementation of the message passing library are presented. As required by the project, the implementation is based on commercial components, namely VMEbus, PCI, the Lynx-OS real-time operating system and an additional inter- processor link, PVIC. The latter offers broadcast functionality identified as being important to the overall performance of the message passing. In addition, performance benchmarks for all implementing buses are presented for both simple test programs and the full DAQ applications. (0 refs)...
Message-Passing Receivers for Single Carrier Systems with Frequency-Domain Equalization
Zhang, Chuanzong; Manchón, Carles Navarro; Wang, Zhongyong
2015-01-01
In this letter, we design iterative receiver algorithms for joint frequency-domain equalization and decoding in a single carrier system assuming perfect channel state information. Based on an approximate inference framework that combines belief propagation (BP) and the mean field (MF) approximation......, we propose two receiver algorithms with, respectively, parallel and sequential message-passing schedules in the MF part. A recently proposed receiver based on generalized approximate message passing (GAMP) is used as a benchmarking reference. The simulation results show that the BP-MF receiver...
McMPI – a managed-code message passing interface library for high performance communication in C#
Holmes, Daniel John
2012-01-01
This work endeavours to achieve technology transfer between established best-practice in academic high-performance computing and current techniques in commercial high-productivity computing. It shows that a credible high-performance message-passing communication library, with semantics and syntax following the Message-Passing Interface (MPI) Standard, can be built in pure C# (one of the .Net suite of computer languages). Message-passing has been the dominant paradigm in high-pe...
Stampi: a message passing library for distributed parallel computing. User's guide
Imamura, Toshiyuki; Koide, Hiroshi; Takemiya, Hiroshi
1998-11-01
A new message passing library, Stampi, has been developed to realize a computation with different kind of parallel computers arbitrarily and making MPI (Message Passing Interface) as an unique interface for communication. Stampi is based on MPI2 specification. It realizes dynamic process creation to different machines and communication between spawned one within the scope of MPI semantics. Vender implemented MPI as a closed system in one parallel machine and did not support both functions; process creation and communication to external machines. Stampi supports both functions and enables us distributed parallel computing. Currently Stampi has been implemented on COMPACS (COMplex PArallel Computer System) introduced in CCSE, five parallel computers and one graphic workstation, and any communication on them can be processed on. (author)
Neighbourhood-consensus message passing and its potentials in image processing applications
Ružic, Tijana; Pižurica, Aleksandra; Philips, Wilfried
2011-03-01
In this paper, a novel algorithm for inference in Markov Random Fields (MRFs) is presented. Its goal is to find approximate maximum a posteriori estimates in a simple manner by combining neighbourhood influence of iterated conditional modes (ICM) and message passing of loopy belief propagation (LBP). We call the proposed method neighbourhood-consensus message passing because a single joint message is sent from the specified neighbourhood to the central node. The message, as a function of beliefs, represents the agreement of all nodes within the neighbourhood regarding the labels of the central node. This way we are able to overcome the disadvantages of reference algorithms, ICM and LBP. On one hand, more information is propagated in comparison with ICM, while on the other hand, the huge amount of pairwise interactions is avoided in comparison with LBP by working with neighbourhoods. The idea is related to the previously developed iterated conditional expectations algorithm. Here we revisit it and redefine it in a message passing framework in a more general form. The results on three different benchmarks demonstrate that the proposed technique can perform well both for binary and multi-label MRFs without any limitations on the model definition. Furthermore, it manifests improved performance over related techniques either in terms of quality and/or speed.
A real-time MPEG software decoder using a portable message-passing library
Kwong, Man Kam; Tang, P.T. Peter; Lin, Biquan
1995-12-31
We present a real-time MPEG software decoder that uses message-passing libraries such as MPL, p4 and MPI. The parallel MPEG decoder currently runs on the IBM SP system but can be easil ported to other parallel machines. This paper discusses our parallel MPEG decoding algorithm as well as the parallel programming environment under which it uses. Several technical issues are discussed, including balancing of decoding speed, memory limitation, 1/0 capacities, and optimization of MPEG decoding components. This project shows that a real-time portable software MPEG decoder is feasible in a general-purpose parallel machine.
The specification of Stampi, a message passing library for distributed parallel computing
Imamura, Toshiyuki; Takemiya, Hiroshi; Koide, Hiroshi
2000-03-01
At CCSE, Center for Promotion of Computational Science and Engineering, a new message passing library for heterogeneous and distributed parallel computing has been developed, and it is called as Stampi. Stampi enables us to communicate between any combination of parallel computers as well as workstations. Currently, a Stampi system is constructed from Stampi library and Stampi/Java. It provides functions to connect a Stampi application with not only those on COMPACS, COMplex Parallel Computer System, but also applets which work on WWW browsers. This report summarizes the specifications of Stampi and details the development of its system. (author)
Stampi: a message passing library for distributed parallel computing. User's guide, second edition
Imamura, Toshiyuki; Koide, Hiroshi; Takemiya, Hiroshi
2000-02-01
A new message passing library, Stampi, has been developed to realize a computation with different kind of parallel computers arbitrarily and making MPI (Message Passing Interface) as an unique interface for communication. Stampi is based on the MPI2 specification, and it realizes dynamic process creation to different machines and communication between spawned one within the scope of MPI semantics. Main features of Stampi are summarized as follows: (i) an automatic switch function between external- and internal communications, (ii) a message routing/relaying with a routing module, (iii) a dynamic process creation, (iv) a support of two types of connection, Master/Slave and Client/Server, (v) a support of a communication with Java applets. Indeed vendors implemented MPI libraries as a closed system in one parallel machine or their systems, and did not support both functions; process creation and communication to external machines. Stampi supports both functions and enables us distributed parallel computing. Currently Stampi has been implemented on COMPACS (COMplex PArallel Computer System) introduced in CCSE, five parallel computers and one graphic workstation, moreover on eight kinds of parallel machines, totally fourteen systems. Stampi provides us MPI communication functionality on them. This report describes mainly the usage of Stampi. (author)
A message-passing approach to random constraint satisfaction problems with growing domains
Zhao, Chunyan; Zheng, Zhiming; Zhou, Haijun; Xu, Ke
2011-01-01
Message-passing algorithms based on belief propagation (BP) are implemented on a random constraint satisfaction problem (CSP) referred to as model RB, which is a prototype of hard random CSPs with growing domain size. In model RB, the number of candidate discrete values (the domain size) of each variable increases polynomially with the variable number N of the problem formula. Although the satisfiability threshold of model RB is exactly known, finding solutions for a single problem formula is quite challenging and attempts have been limited to cases of N ∼ 10 2 . In this paper, we propose two different kinds of message-passing algorithms guided by BP for this problem. Numerical simulations demonstrate that these algorithms allow us to find a solution for random formulas of model RB with constraint tightness slightly less than p cr , the threshold value for the satisfiability phase transition. To evaluate the performance of these algorithms, we also provide a local search algorithm (random walk) as a comparison. Besides this, the simulated time dependence of the problem size N and the entropy of the variables for growing domain size are discussed
A Message-Passing Hardware/Software Cosimulation Environment for Reconfigurable Computing Systems
Manuel Saldaña
2009-01-01
Full Text Available High-performance reconfigurable computers (HPRCs provide a mix of standard processors and FPGAs to collectively accelerate applications. This introduces new design challenges, such as the need for portable programming models across HPRCs and system-level verification tools. To address the need for cosimulating a complete heterogeneous application using both software and hardware in an HPRC, we have created a tool called the Message-passing Simulation Framework (MSF. We have used it to simulate and develop an interface enabling an MPI-based approach to exchange data between X86 processors and hardware engines inside FPGAs. The MSF can also be used as an application development tool that enables multiple FPGAs in simulation to exchange messages amongst themselves and with X86 processors. As an example, we simulate a LINPACK benchmark hardware core using an Intel-FSB-Xilinx-FPGA platform to quickly prototype the hardware, to test the communications. and to verify the benchmark results.
Partition function expansion on region graphs and message-passing equations
Zhou, Haijun; Wang, Chuang; Xiao, Jing-Qing; Bi, Zedong
2011-01-01
Disordered and frustrated graphical systems are ubiquitous in physics, biology, and information science. For models on complete graphs or random graphs, deep understanding has been achieved through the mean-field replica and cavity methods. But finite-dimensional 'real' systems remain very challenging because of the abundance of short loops and strong local correlations. A statistical mechanics theory is constructed in this paper for finite-dimensional models based on the mathematical framework of the partition function expansion and the concept of region graphs. Rigorous expressions for the free energy and grand free energy are derived. Message-passing equations on the region graph, such as belief propagation and survey propagation, are also derived rigorously. (letter)
Khoshfetrat Pakazad, Sina; Hansson, Anders; Andersen, Martin S.
2017-01-01
In this paper, we propose a distributed algorithm for solving coupled problems with chordal sparsity or an inherent tree structure which relies on primal–dual interior-point methods. We achieve this by distributing the computations at each iteration, using message-passing. In comparison to existi...
Deng Li; Xie Zhongsheng
1999-01-01
The coupled neutron and photon transport Monte Carlo code MCNP (version 3B) has been parallelized in parallel virtual machine (PVM) and message passing interface (MPI) by modifying a previous serial code. The new code has been verified by solving sample problems. The speedup increases linearly with the number of processors and the average efficiency is up to 99% for 12-processor. (author)
Design of a Message Passing Model for Use in a Heterogeneous CPU-NFP Framework for Network Analytics
Pennefather, S
2017-09-01
Full Text Available of applications written in the Go programming language to be executed on a Network Flow Processor (NFP) for enhanced performance. This paper explores the need and feasibility of implementing a message passing model for data transmission between the NFP and CPU...
Sakata, Ayaka; Xu, Yingying
2018-03-01
We analyse a linear regression problem with nonconvex regularization called smoothly clipped absolute deviation (SCAD) under an overcomplete Gaussian basis for Gaussian random data. We propose an approximate message passing (AMP) algorithm considering nonconvex regularization, namely SCAD-AMP, and analytically show that the stability condition corresponds to the de Almeida-Thouless condition in spin glass literature. Through asymptotic analysis, we show the correspondence between the density evolution of SCAD-AMP and the replica symmetric (RS) solution. Numerical experiments confirm that for a sufficiently large system size, SCAD-AMP achieves the optimal performance predicted by the replica method. Through replica analysis, a phase transition between replica symmetric and replica symmetry breaking (RSB) region is found in the parameter space of SCAD. The appearance of the RS region for a nonconvex penalty is a significant advantage that indicates the region of smooth landscape of the optimization problem. Furthermore, we analytically show that the statistical representation performance of the SCAD penalty is better than that of \
Using Partial Reconfiguration and Message Passing to Enable FPGA-Based Generic Computing Platforms
Manuel Saldaña
2012-01-01
Full Text Available Partial reconfiguration (PR is an FPGA feature that allows the modification of certain parts of an FPGA while the rest of the system continues to operate without disruption. This distinctive characteristic of FPGAs has many potential benefits but also challenges. The lack of good CAD tools and the deep hardware knowledge requirement result in a hard-to-use feature. In this paper, the new partition-based Xilinx PR flow is used to incorporate PR within our MPI-based message-passing framework to allow hardware designers to create template bitstreams, which are predesigned, prerouted, generic bitstreams that can be reused for multiple applications. As an example of the generality of this approach, four different applications that use the same template bitstream are run consecutively, with a PR operation performed at the beginning of each application to instantiate the desired application engine. We demonstrate a simplified, reusable, high-level, and portable PR interface for X86-FPGA hybrid machines. PR issues such as local resets of reconfigurable modules and context saving and restoring are addressed in this paper followed by some examples and preliminary PR overhead measurements.
Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems. Ph.D. Thesis
Wang, Yi-Min
1993-01-01
Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated.
Message passing vs. shared address space on a cluster of SMPs
Shan, Hongzhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswas, Rupak
2001-01-01
The emergence of scalable computer architectures using clusters of PCs or PC-SMPs with commodity networking has made them attractive platforms for high-end scientific computing. Currently, message passing (MP) and shared address space (SAS) are the two leading programming paradigms for these systems. MP has been standardized with MPI, and is the most common and mature parallel programming approach. However, MP code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, they compare the performance of and programming effort required for six applications under both programming models on a 32-CPU PC-SMP cluster. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of the applications; however, on certain classes of problems, SAS performance is competitive with MPI
CUBESIM, Hypercube and Denelcor Hep Parallel Computer Simulation
Dunigan, T.H.
1988-01-01
1 - Description of program or function: CUBESIM is a set of subroutine libraries and programs for the simulation of message-passing parallel computers and shared-memory parallel computers. Subroutines are supplied to simulate the Intel hypercube and the Denelcor HEP parallel computers. The system permits a user to develop and test parallel programs written in C or FORTRAN on a single processor. The user may alter such hypercube parameters as message startup times, packet size, and the computation-to-communication ratio. The simulation generates a trace file that can be used for debugging, performance analysis, or graphical display. 2 - Method of solution: The CUBESIM simulator is linked with the user's parallel application routines to run as a single UNIX process. The simulator library provides a small operating system to perform process and message management. 3 - Restrictions on the complexity of the problem: Up to 128 processors can be simulated with a virtual memory limit of 6 million bytes. Up to 1000 processes can be simulated
On the adequacy of message-passing parallel supercomputers for solving neutron transport problems
Azmy, Y.Y.
1990-01-01
A coarse-grained, static-scheduling parallelization of the standard iterative scheme used for solving the discrete-ordinates approximation of the neutron transport equation is described. The parallel algorithm is based on a decomposition of the angular domain along the discrete ordinates, thus naturally producing a set of completely uncoupled systems of equations in each iteration. Implementation of the parallel code on Intcl's iPSC/2 hypercube, and solutions to test problems are presented as evidence of the high speedup and efficiency of the parallel code. The performance of the parallel code on the iPSC/2 is analyzed, and a model for the CPU time as a function of the problem size (order of angular quadrature) and the number of participating processors is developed and validated against measured CPU times. The performance model is used to speculate on the potential of massively parallel computers for significantly speeding up real-life transport calculations at acceptable efficiencies. We conclude that parallel computers with a few hundred processors are capable of producing large speedups at very high efficiencies in very large three-dimensional problems. 10 refs., 8 figs
The design of multi-core DSP parallel model based on message passing and multi-level pipeline
Niu, Jingyu; Hu, Jian; He, Wenjing; Meng, Fanrong; Li, Chuanrong
2017-10-01
Currently, the design of embedded signal processing system is often based on a specific application, but this idea is not conducive to the rapid development of signal processing technology. In this paper, a parallel processing model architecture based on multi-core DSP platform is designed, and it is mainly suitable for the complex algorithms which are composed of different modules. This model combines the ideas of multi-level pipeline parallelism and message passing, and summarizes the advantages of the mainstream model of multi-core DSP (the Master-Slave model and the Data Flow model), so that it has better performance. This paper uses three-dimensional image generation algorithm to validate the efficiency of the proposed model by comparing with the effectiveness of the Master-Slave and the Data Flow model.
A. Averbuch
1994-01-01
Full Text Available Parallel elliptic single/multigrid solutions around an aligned and nonaligned body are presented and implemented on two multi-user and single-user shared memory multiprocessors (Sequent Symmetry and MOS and on a distributed memory multiprocessor (a Transputer network. Our parallel implementation uses the Virtual Machine for Muli-Processors (VMMP, a software package that provides a coherent set of services for explicitly parallel application programs running on diverse multiple instruction multiple data (MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementing the same services on all target multiprocessors. The performance of our algorithm is investigated in detail. It is seen to fit well the above architectures when the number of processors is less than the maximal number of grid points along the axes. In general, the efficiency in the nonaligned case is higher than in the aligned case. Alignment overhead is observed to be up to 200% in the shared-memory case and up to 65% in the message-passing case. We have demonstrated that when using VMMP, the portability of the algorithms is straightforward and efficient.
A survey of Tumult, a real-time multi-processor system
Jansen, P.G.
1986-01-01
Tumult (Twente University MULTi processor system) is the name of an ongoing project aiming at the design and implementation of a modular extendible multiprocessor system. All memory is distributed and processors communicate in parallel via a fast and reliable local switching network instead of a shared bus. A distributed real-time operating system is being designed and implemented, consisting of a multi-tasking subsystem per processor. Processes can communicate via a message passing mechanism. Communication links and processes are dynamically created and disposed by the application. In this article a brief description of the system is given; communication aspects are emphasized. (Auth.)
Hyperswitch Network For Hypercube Computer
Chow, Edward; Madan, Herbert; Peterson, John
1989-01-01
Data-driven dynamic switching enables high speed data transfer. Proposed hyperswitch network based on mixed static and dynamic topologies. Routing header modified in response to congestion or faults encountered as path established. Static topology meets requirement if nodes have switching elements that perform necessary routing header revisions dynamically. Hypercube topology now being implemented with switching element in each computer node aimed at designing very-richly-interconnected multicomputer system. Interconnection network connects great number of small computer nodes, using fixed hypercube topology, characterized by point-to-point links between nodes.
Multiprocessor data acquisition system
Haumann, J.R.; Crawford, R.K.
1987-01-01
A multiprocessor data acquisition system has been built to replace the single processor systems at the Intense Pulsed Neutron Source (IPNS) at Argonne National Laboratory. The multiprocessor system was needed to accommodate the higher data rates at IPNS brought about by improvements in the source and changes in instrument configurations. This paper describes the hardware configuration of the system and the method of task sharing and compares results to the single processor system
Ricci-Tersenghi, Federico; Zdeborova, Lenka; Zecchina, Riccardo; Tramel, Eric W; Cugliandolo, Leticia F
2015-01-01
This book contains a collection of the presentations that were given in October 2013 at the Les Houches Autumn School on statistical physics, optimization, inference, and message-passing algorithms. In the last decade, there has been increasing convergence of interest and methods between theoretical physics and fields as diverse as probability, machine learning, optimization, and inference problems. In particular, much theoretical and applied work in statistical physics and computer science has relied on the use of message-passing algorithms and their connection to the statistical physics of glasses and spin glasses. For example, both the replica and cavity methods have led to recent advances in compressed sensing, sparse estimation, and random constraint satisfaction, to name a few. This book’s detailed pedagogical lectures on statistical inference, computational complexity, the replica and cavity methods, and belief propagation are aimed particularly at PhD students, post-docs, and young researchers desir...
Altarelli, Fabrizio; Monasson, Remi; Zamponi, Francesco
2007-01-01
For large clause-to-variable ratios, typical K-SAT instances drawn from the uniform distribution have no solution. We argue, based on statistical mechanics calculations using the replica and cavity methods, that rare satisfiable instances from the uniform distribution are very similar to typical instances drawn from the so-called planted distribution, where instances are chosen uniformly between the ones that admit a given solution. It then follows, from a recent article by Feige, Mossel and Vilenchik (2006 Complete convergence of message passing algorithms for some satisfiability problems Proc. Random 2006 pp 339-50), that these rare instances can be easily recognized (in O(log N) time and with probability close to 1) by a simple message-passing algorithm
Multiprocessor programming environment
Smith, M.B.; Fornaro, R.
1988-12-01
Programming tools and techniques have been well developed for traditional uniprocessor computer systems. The focus of this research project is on the development of a programming environment for a high speed real time heterogeneous multiprocessor system, with special emphasis on languages and compilers. The new tools and techniques will allow a smooth transition for programmers with experience only on single processor systems.
The art of multiprocessor programming
Herlihy, Maurice
2012-01-01
Revised and updated with improvements conceived in parallel programming courses, The Art of Multiprocessor Programming is an authoritative guide to multicore programming. It introduces a higher level set of software development skills than that needed for efficient single-core programming. This book provides comprehensive coverage of the new principles, algorithms, and tools necessary for effective multiprocessor programming. Students and professionals alike will benefit from thorough coverage of key multiprocessor programming issues. This revised edition incorporates much-demanded updates t
Multiprocessor development for robot control
Lee, Jong Min; Kim, Byung Soo; Kim, Chang Hoi; Hwang, Suk Yong; Sohn, Surg Won; Yoon, Tae Seob; Lee, Yong Bum; Kim, Woong Ki
1988-02-01
A mutiprocessor system that is essential to A.I. (Artificial Intelligence) robot control was developed. A.I. robot control needs very complex real time control. The multiprocessor system interconnecting many SBC's (Single Board Computer) is much faster and accurater than using only one SBC. Various multiprocessor systems and their applications were compared and discussed. The multiprocessor architecture system is specially designed to be used in nuclear environments. The main functions are job distribution, multitasking, and intelligent remote control by SDLC protocol using optical fiber. The system can be applied to position control for locomotion and manipulation, data fusion system, and image processing. (Author)
Li, J; Guo, L-X; Zeng, H; Han, X-B
2009-06-01
A message-passing-interface (MPI)-based parallel finite-difference time-domain (FDTD) algorithm for the electromagnetic scattering from a 1-D randomly rough sea surface is presented. The uniaxial perfectly matched layer (UPML) medium is adopted for truncation of FDTD lattices, in which the finite-difference equations can be used for the total computation domain by properly choosing the uniaxial parameters. This makes the parallel FDTD algorithm easier to implement. The parallel performance with different processors is illustrated for one sea surface realization, and the computation time of the parallel FDTD algorithm is dramatically reduced compared to a single-process implementation. Finally, some numerical results are shown, including the backscattering characteristics of sea surface for different polarization and the bistatic scattering from a sea surface with large incident angle and large wind speed.
Multiprocessors for high energy physics
Pohl, M.
1987-01-01
I review the role, status and progress of multiprocessor projects relevant to high energy physics. A short overview of the large variety of multiprocessors architectures is given, with special emphasis on machines suitable for experimental data reconstruction. A lot of progress has been made in the attempt to make the use of multiprocessors less painful by creating a ''Parallel Programming Environment'' supporting the non-expert user. A high degree of usability has been reached for coarse grain (event level) parallelism. The program development tools available on various systems (subroutine packages, preprocessors and parallelizing compilers) are discussed in some detail. Tools for execution control and debugging are also developing, thus opening the path from dedicated systems for large scale, stable production towards support of a more general job mix. At medium term, multiprocessors will thus cover a growing fraction of the typical high energy physics computing task. (orig.)
Embedded multiprocessors scheduling and synchronization
Sriram, Sundararajan
2009-01-01
Techniques for Optimizing Multiprocessor Implementations of Signal Processing ApplicationsAn indispensable component of the information age, signal processing is embedded in a variety of consumer devices, including cell phones and digital television, as well as in communication infrastructure, such as media servers and cellular base stations. Multiple programmable processors, along with custom hardware running in parallel, are needed to achieve the computation throughput required of such applications. Reviews important research in key areas related to the multiprocessor implementation of multi
Latin hypercube sampling with inequality constraints
Iooss, B.; Petelet, M.; Asserin, O.; Loredo, A.
2010-01-01
In some studies requiring predictive and CPU-time consuming numerical models, the sampling design of the model input variables has to be chosen with caution. For this purpose, Latin hypercube sampling has a long history and has shown its robustness capabilities. In this paper we propose and discuss a new algorithm to build a Latin hypercube sample (LHS) taking into account inequality constraints between the sampled variables. This technique, called constrained Latin hypercube sampling (cLHS), consists in doing permutations on an initial LHS to honor the desired monotonic constraints. The relevance of this approach is shown on a real example concerning the numerical welding simulation, where the inequality constraints are caused by the physical decreasing of some material properties in function of the temperature. (authors)
Edge-Disjoint Fibonacci Trees in Hypercube
Indhumathi Raman
2014-01-01
Full Text Available The Fibonacci tree is a rooted binary tree whose number of vertices admit a recursive definition similar to the Fibonacci numbers. In this paper, we prove that a hypercube of dimension h admits two edge-disjoint Fibonacci trees of height h, two edge-disjoint Fibonacci trees of height h-2, two edge-disjoint Fibonacci trees of height h-4 and so on, as subgraphs. The result shows that an algorithm with Fibonacci trees as underlying data structure can be implemented concurrently on a hypercube network with no communication latency.
Lattice gauge theory on the hypercube
Apostolakis, J.; Baillie, C.; Ding, Hong-Qiang; Flower, J.
1988-01-01
Lattice gauge theory, an extremely computationally intensive problem, has been run successfully on hypercubes for a number of years. Herein we give a flavor of this work, discussing both the physics and the computing behind it. 19 refs., 5 figs., 27 tabs
Multiprocessor systems and their concurrency
Starke, P H
1984-01-01
A multiprocessor system can be considered as a collection of finite automata which communicate over channels or shared memory units. The behaviour of such a system can be described by a semilanguage. This approach allows to define a numerical measure for the concurrency of multiprocessor systems and of distributed systems. This measure is characterized algebraically and the reconfiguration problem asking for an algorithm to construct an l-processor system which is equivalent to a given n-processor system is solved in the paper. 6 references.
Multiprocessor development for robot control
Lee, Jong Min; Kim, Seung Ho; Hwang, Suk Yeoung; Sohn, Surg Won; Kim, Byung Soo; Kim, Chang Hoi; Lee, Yong Bum; Kim, Woong Ki
1988-12-01
The object of this project is to develop a multiprocessor system which is essential to robot technology. A multiprocessor system interconnecting many single board computer is much faster and flexible than a single processor. The developed multiprocessor will be used to control nuclear mobile robot, so a loosely coupled system is adopted as a robot controller. A total configuration of controller is divided into three main parts in related with its function. It is consisted of supervisory control part, functional control part, remote control part. The designed control system is to be expanded easily for further use with a modular architecture, so the functional independency within sub-systems can be obtained throughout the system structure. Electromagnetic interference affecting to the control system is minimized by using optical fiber as communication media between robot and control system. System performances is enhanced not only by using distributed architecture in hardware, but by adopting real-time, multi-tasking operating system in software. The iRMX86 OS is used and reconfigured for real-time, multi-tasking operation. RS-485 serial communication protocol is used between functional control part and remote control part. Since the developed multiprocessor control system is an essential and fundamental technology for artificial intelligent robot, the result of this project can be applied directly to nuclear mobile robot. (Author)
A Multiprocessor Operating System Simulator
Johnston, Gary M.; Campbell, Roy H.
1988-01-01
This paper describes a multiprocessor operating system simulator that was developed by the authors in the Fall semester of 1987. The simulator was built in response to the need to provide students with an environment in which to build and test operating system concepts as part of the coursework of a third-year undergraduate operating systems course. Written in C++, the simulator uses the co-routine style task package that is distributed with the AT&T C++ Translator to provide a hierarchy of classes that represents a broad range of operating system software and hardware components. The class hierarchy closely follows that of the 'Choices' family of operating systems for loosely- and tightly-coupled multiprocessors. During an operating system course, these classes are refined and specialized by students in homework assignments to facilitate experimentation with different aspects of operating system design and policy decisions. The current implementation runs on the IBM RT PC under 4.3bsd UNIX.
Reproducibility in a multiprocessor system
Bellofatto, Ralph A; Chen, Dong; Coteus, Paul W; Eisley, Noel A; Gara, Alan; Gooding, Thomas M; Haring, Rudolf A; Heidelberger, Philip; Kopcsay, Gerard V; Liebsch, Thomas A; Ohmacht, Martin; Reed, Don D; Senger, Robert M; Steinmacher-Burow, Burkhard; Sugawara, Yutaka
2013-11-26
Fixing a problem is usually greatly aided if the problem is reproducible. To ensure reproducibility of a multiprocessor system, the following aspects are proposed; a deterministic system start state, a single system clock, phase alignment of clocks in the system, system-wide synchronization events, reproducible execution of system components, deterministic chip interfaces, zero-impact communication with the system, precise stop of the system and a scan of the system state.
Single-chip serial channel enhances multi-processor systems
Millar, J.
1982-01-01
In this paper multiprocessor systems are described and explained. The impact that VLSI advancements are having on multiprocessor design is pointed out. The TMS 7041 single-chip microcomputer is described briefly, highlighting its multiprocessor communication capability. And finally, a typical multiprocessor system is shown, implementing the TMS 7041.
Introduction to parallel algorithms and architectures arrays, trees, hypercubes
Leighton, F Thomson
1991-01-01
Introduction to Parallel Algorithms and Architectures: Arrays Trees Hypercubes provides an introduction to the expanding field of parallel algorithms and architectures. This book focuses on parallel computation involving the most popular network architectures, namely, arrays, trees, hypercubes, and some closely related networks.Organized into three chapters, this book begins with an overview of the simplest architectures of arrays and trees. This text then presents the structures and relationships between the dominant network architectures, as well as the most efficient parallel algorithms for
Hypercat - Hypercube of Clumpy AGN Tori
Nikutta, Robert; Lopez-Rodriguez, Enrique; Ichikawa, Kohei; Levenson, Nancy; Packham, Christopher C.
2017-06-01
Dusty tori surrounding the central engines of Active Galactic Nuclei (AGN) are required by the Unification Paradigm, and are supported by many observations, e.g. variable nuclear absorber (sometimes Compton-thick) in X-rays, reverberation mapping in optical/UV, hot dust emission and SED shapes in NIR/MIR, molecular and cool-dust tori observed with ALMA in sub-mm.While models of AGN torus SEDs have been developed and utilized for a long time, the study of the resolved emission morphology (brightness maps) has so far been under-appreciated, presumably because resolved observations of the central parsec in AGN are only possible very recently. Currently, only NIR+MIR interferometry is capable of resolving the nuclear dust emission (but not of producing images, until MATISSE comes online). Furthermore, MIR interferometry has delivered also puzzling results, e.g. that in some resolved sources the light emanates preferentially from polar directions above the "torus" system, and not from the equatorial plane, where most of the dust is located.We are preparing the release of a panchromatic, fully interpolable hypercube of brightness maps and projected dust images for a large number of CLUMPY torus models (Nenkova+2008), that will help facilitate studies of resolved AGN emission and dust morphologies. Together with the cube we will release a comprehensive set of open-source tools (Python) that will enable researches to work efficiently with this large hypercube:* easy sub-cube selection + memory-mapping (mitigating the too-big-for-RAM problem)* multi-dim image interpolation (get an image at any wavelength & model parameter combination)* simulation of observations with telescopes (compute/provide + apply a PSF) and interferometers (get visibilities)* analyze images with respect to the power contained at all scales and orientations (via 2D steerable wavelets), addressing the seemingly puzzling results mentioned aboveA series of papers is in preparation, aiming at solving the
Multiprocessor scheduling for real-time systems
Baruah, Sanjoy; Buttazzo, Giorgio
2015-01-01
This book provides a comprehensive overview of both theoretical and pragmatic aspects of resource-allocation and scheduling in multiprocessor and multicore hard-real-time systems. The authors derive new, abstract models of real-time tasks that capture accurately the salient features of real application systems that are to be implemented on multiprocessor platforms, and identify rules for mapping application systems onto the most appropriate models. New run-time multiprocessor scheduling algorithms are presented, which are demonstrably better than those currently used, both in terms of run-time efficiency and tractability of off-line analysis. Readers will benefit from a new design and analysis framework for multiprocessor real-time systems, which will translate into a significantly enhanced ability to provide formally verified, safety-critical real-time systems at a significantly lower cost.
The structural robustness of multiprocessor computing system
N. Andronaty
1996-03-01
Full Text Available The model of the multiprocessor computing system on the base of transputers which permits to resolve the question of valuation of a structural robustness (viability, survivability is described.
Multiprocessor development for robot control
Lee, John Min; Kim, Seung Ho; Kim, Chang Hoi; Kim, Byung Soo; Hwang, Suk Yeong; Lee, Young Bum; Sohn, Suk Won; Kim, Woon Gi
1990-01-01
The project of this study is to develop a real time controller applying autonomous robotic systems operated in hostile environment. Developed control system is designed with a multiprocessor to get independency and reliability as well as to extend the system easily. The control system is designed in three distinct subsystems (supervisory control part, functional control part, and remote control part). To review the functional performance of developed controller, a prototype mobile robot, which was installed 4 DOF mainpulator, was designed and manufactured. Initial tests showed that the robot could turn with a radius of 38 cm and a maximum speed of 1.26 km/hr and go over obstacle of 18 cm in height. (author)
Multiprocessor performance modeling with ADAS
Hayes, Paul J.; Andrews, Asa M.
1989-01-01
A graph managing strategy referred to as the Algorithm to Architecture Mapping Model (ATAMM) appears useful for the time-optimized execution of application algorithm graphs in embedded multiprocessors and for the performance prediction of graph designs. This paper reports the modeling of ATAMM in the Architecture Design and Assessment System (ADAS) to make an independent verification of ATAMM's performance prediction capability and to provide a user framework for the evaluation of arbitrary algorithm graphs. Following an overview of ATAMM and its major functional rules are descriptions of the ADAS model of ATAMM, methods to enter an arbitrary graph into the model, and techniques to analyze the simulation results. The performance of a 7-node graph example is evaluated using the ADAS model and verifies the ATAMM concept by substantiating previously published performance results.
Multiprocessor architecture: Synthesis and evaluation
Standley, Hilda M.
1990-01-01
Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Realtime multiprocessor for mobile ad hoc networks
T. Jungeblut
2008-05-01
Full Text Available This paper introduces a real-time Multiprocessor System-On-Chip (MPSoC for low power wireless applications. The multiprocessor is based on eight 32bit RISC processors that are connected via an Network-On-Chip (NoC. The NoC follows a novel approach with guaranteed bandwidth to the application that meets hard realtime requirements. At a clock frequency of 100 MHz the total power consumption of the MPSoC that has been fabricated in 180 nm UMC standard cell technology is 772 mW.
A class of integrals on squares, cubes and hypercubes
McCartney, Mark
2014-04-01
In this note, a class of integrals on the unit hypercube are considered and series solutions presented. The material is at a level appropriate for use with undergraduates, either as part of a course on multivariable calculus, or as a short independent research project. Suitable exercises for use in these contexts are given.
Communication overhead on the Intel iPSC-860 hypercube
Bokhari, Shahid H.
1990-01-01
Experiments were conducted on the Intel iPSC-860 hypercube in order to evaluate the overhead of interprocessor communication. It is demonstrated that: (1) contrary to popular belief, the distance between two communicating processors has a significant impact on communication time, (2) edge contention can increase communication time by a factor of more than 7, and (3) node contention has no measurable impact.
Embedding complete ternary tree in hypercubes using AVL trees
S.A. Choudum; I. Raman (Indhumathi)
2008-01-01
htmlabstractA complete ternary tree is a tree in which every non-leaf vertex has exactly three children. We prove that a complete ternary tree of height h, TTh, is embeddable in a hypercube of dimension . This result coincides with the result of [2]. However, in this paper, the embedding utilizes
Graph Modeling for Quadratic Assignment Problems Associated with the Hypercube
Mittelmann, Hans; Peng Jiming; Wu Xiaolin
2009-01-01
In the paper we consider the quadratic assignment problem arising from channel coding in communications where one coefficient matrix is the adjacency matrix of a hypercube in a finite dimensional space. By using the geometric structure of the hypercube, we first show that there exist at least n different optimal solutions to the underlying QAPs. Moreover, the inherent symmetries in the associated hypercube allow us to obtain partial information regarding the optimal solutions and thus shrink the search space and improve all the existing QAP solvers for the underlying QAPs.Secondly, we use graph modeling technique to derive a new integer linear program (ILP) models for the underlying QAPs. The new ILP model has n(n-1) binary variables and O(n 3 log(n)) linear constraints. This yields the smallest known number of binary variables for the ILP reformulation of QAPs. Various relaxations of the new ILP model are obtained based on the graphical characterization of the hypercube, and the lower bounds provided by the LP relaxations of the new model are analyzed and compared with what provided by several classical LP relaxations of QAPs in the literature.
Divergence-Free Wavelets on the Hypercube : General Boundary Conditions
Stevenson, R.
2016-01-01
On the n-dimensional hypercube, for given k∈N, wavelet Riesz bases are constructed for the subspace of divergence-free vector fields of the Sobolev space Hk((0,1)n)n with general homogeneous Dirichlet boundary conditions, including slip or no-slip boundary conditions. Both primal and suitable dual
Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.
1989-01-01
Several techniques to perform static and dynamic load balancing techniques for vision systems are presented. These techniques are novel in the sense that they capture the computational requirements of a task by examining the data when it is produced. Furthermore, they can be applied to many vision systems because many algorithms in different systems are either the same, or have similar computational characteristics. These techniques are evaluated by applying them on a parallel implementation of the algorithms in a motion estimation system on a hypercube multiprocessor system. The motion estimation system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from different time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters. It is shown that the performance gains when these data decomposition and load balancing techniques are used are significant and the overhead of using these techniques is minimal.
One-Step Programmable Arbiters for Multiprocessors
Højberg, Kristian Søe
1978-01-01
When processors in a multiprocessor system demand service from a shared bus in an asynchronous mode, a synchronous state arbiter resolves conflicts and allocates resources. Independent of the combination of requests, only one state transition is required from a free to allocated resource...
Shared performance monitor in a multiprocessor system
Chiu, George; Gara, Alan G.; Salapura, Valentina
2012-07-24
A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.
The fast Amsterdam multiprocessor (FAMP) system hardware
Hertzberger, L.O.; Kieft, G.; Kisielewski, B.; Wiggers, L.W.; Engster, C.; Koningsveld, L. van
1981-01-01
The architecture of a multiprocessor system is described that will be used for on-line filter and second stage trigger applications. The system is based on the MC 68000 microprocessor from Motorola. Emphasis is paid to hardware aspects, in particular the modularity, processor communication and interfacing, whereas the system software and the applications will be described in separate articles. (orig.)
Academic training: Advanced lectures on multiprocessor programming
PH Department
2011-01-01
Academic Training Lecture - Regular Programme 31 October 1, 2 November 2011 from 11:00 to 12:00 - IT Auditorium, Bldg. 31 Three classes (60 mins) on Multiprocessor Programming Prof. Dr. Christoph von Praun Georg-Simon-Ohm University of Applied Sciences Nuremberg, Germany This is an advanced class on multiprocessor programming. The class gives an introduction to principles of concurrent objects and the notion of different progress guarantees that concurrent computations can have. The focus of this class is on non-blocking computations, i.e. concurrent programs that do not make use of locks. We discuss the implementation of practical non-blocking data structures in detail. 1st class: Introduction to concurrent objects 2nd class: Principles of non-blocking synchronization 3rd class: Concurrent queues Brief Bio of Christoph von Praun Christoph worked on a variety of analysis techniques and runtime platforms for parallel programs. Hist most recent research studies programming models an...
Debugging in a multi-processor environment
Spann, J.M.
1981-01-01
The Supervisory Control and Diagnostic System (SCDS) for the Mirror Fusion Test Facility (MFTF) consists of nine 32-bit minicomputers arranged in a tightly coupled distributed computer system utilizing a share memory as the data exchange medium. Debugging of more than one program in the multi-processor environment is a difficult process. This paper describes what new tools were developed and how the testing of software is performed in the SCDS for the MFTF project
A user`s guide to LHS: Sandia`s Latin Hypercube Sampling Software
Wyss, G.D.; Jorgensen, K.H. [Sandia National Labs., Albuquerque, NM (United States). Risk Assessment and Systems Modeling Dept.
1998-02-01
This document is a reference guide for LHS, Sandia`s Latin Hypercube Sampling Software. This software has been developed to generate either Latin hypercube or random multivariate samples. The Latin hypercube technique employs a constrained sampling scheme, whereas random sampling corresponds to a simple Monte Carlo technique. The present program replaces the previous Latin hypercube sampling program developed at Sandia National Laboratories (SAND83-2365). This manual covers the theory behind stratified sampling as well as use of the LHS code both with the Windows graphical user interface and in the stand-alone mode.
Modeling and Analyzing Real-Time Multiprocessor Systems
Wiggers, M.H.; Thiele, Lothar; Lee, Edward A.; Schlieker, Simon; Bekooij, Marco Jan Gerrit
2010-01-01
Researchers have proposed approaches to verify that real-time multiprocessor systems meet their timeliness constraints. These approaches make assumptions on the model of computation, the load placed on the multiprocessor system, and the faults that can arise. This heterogeneous set of assumptions
Nemati, F.; Behnam, M.; Bril, R.J.
2009-01-01
In the multi-core and multiprocessor domain, there has been considerable work done on scheduling techniques assuming that real-time tasks are independent. In practice a typical real-time system usually share logical resources among tasks. However, synchronization in the multiprocessor area has not
Extension of Latin hypercube samples with correlated variables
Sallaberry, C.J. [Sandia National Laboratories, Department 6784, MS 0776, Albuquerque, NM 87185-0776 (United States); Helton, J.C. [Department of Mathematics and Statistics, Arizona State University, Tempe, AZ 85287-1804 (United States)], E-mail: jchelto@sandia.gov; Hora, S.C. [University of Hawaii at Hilo, Hilo, HI 96720-4091 (United States)
2008-07-15
A procedure for extending the size of a Latin hypercube sample (LHS) with rank correlated variables is described and illustrated. The extension procedure starts with an LHS of size m and associated rank correlation matrix C and constructs a new LHS of size 2m that contains the elements of the original LHS and has a rank correlation matrix that is close to the original rank correlation matrix C. The procedure is intended for use in conjunction with uncertainty and sensitivity analysis of computationally demanding models in which it is important to make efficient use of a necessarily limited number of model evaluations.
Extension of Latin hypercube samples with correlated variables
Sallaberry, C.J.; Helton, J.C.; Hora, S.C.
2008-01-01
A procedure for extending the size of a Latin hypercube sample (LHS) with rank correlated variables is described and illustrated. The extension procedure starts with an LHS of size m and associated rank correlation matrix C and constructs a new LHS of size 2m that contains the elements of the original LHS and has a rank correlation matrix that is close to the original rank correlation matrix C. The procedure is intended for use in conjunction with uncertainty and sensitivity analysis of computationally demanding models in which it is important to make efficient use of a necessarily limited number of model evaluations
Potts ferromagnet correlation length in hypercubic lattices: Renormalization - group approach
Curado, E.M.F.; Hauser, P.R.
1984-01-01
Through a real space renormalization group approach, the q-state Potts ferromagnet correlation length on hierarchical lattices is calculated. These hierarchical lattices are build in order to simulate hypercubic lattices. The high-and-low temperature correlation length asymptotic behaviours tend (in the Ising case) to the Bravais lattice correlation length ones when the size of the hierarchical lattice cells tends to infinity. It is conjectured that the asymptotic behaviours several values of q and d (dimensionality) so obtained are correct. Numerical results are obtained for the full temperature range of the correlation length. (Author) [pt
A spherical x-ray transform and hypercube sections
Kazantsev, Ivan G; Schmidt, Soren
2014-01-01
We investigate the problem of sampling a unit great circle on the unit sphere S-3 as a support of orientation distribution functions on which acts the discrete spherical x-ray transform. The circle's partition subsets are gnomonically mapped onto lines that constitute a convex polygon inside...... the bounding cubes of hypercube. Thus the problem of the great circle tracing is reduced to the problem of the four-dimensional cube sectioning by the plane containing the circle and the intersection figure (the polygon) vertices finding. In this paper, a fast, non-combinatorial approach for the polygon...
Extension of latin hypercube samples with correlated variables.
Hora, Stephen Curtis (University of Hawaii at Hilo, HI); Helton, Jon Craig (Arizona State University, Tempe, AZ); Sallaberry, Cedric J. PhD. (.; .)
2006-11-01
A procedure for extending the size of a Latin hypercube sample (LHS) with rank correlated variables is described and illustrated. The extension procedure starts with an LHS of size m and associated rank correlation matrix C and constructs a new LHS of size 2m that contains the elements of the original LHS and has a rank correlation matrix that is close to the original rank correlation matrix C. The procedure is intended for use in conjunction with uncertainty and sensitivity analysis of computationally demanding models in which it is important to make efficient use of a necessarily limited number of model evaluations.
Distributed parallel messaging for multiprocessor systems
Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka
2013-06-04
A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.
Hypercube algorithms suitable for image understanding in uncertain environments
Huntsberger, T.L.; Sengupta, A.
1988-01-01
Computer vision in a dynamic environment needs to be fast and able to tolerate incomplete or uncertain intermediate results. An appropriately chose representation coupled with a parallel architecture addresses both concerns. The wide range of numerical and symbolic processing needed for robust computer vision can only be achieved through a blend of SIMD and MIMD processing techniques. The 1024 element hypercube architecture has these capabilities, and was chosen as the test-bed hardware for development of highly parallel computer vision algorithms. This paper presents and analyzes parallel algorithms for color image segmentation and edge detection. These algorithms are part of a recently developed computer vision system which uses multiple valued logic to represent uncertainty in the imaging process and in intermediate results. Algorithms for the extraction of three dimensional properties of objects using dynamic scene analysis techniques within the same framework are examined. Results from experimental studies using a 1024 element hypercube implementation of the algorithm as applied to a series of natural scenes are reported
Chen, C.M.; Lee, S.Y.
1995-01-01
The EM algorithm promises an estimated image with the maximal likelihood for 3D PET image reconstruction. However, due to its long computation time, the EM algorithm has not been widely used in practice. While several parallel implementations of the EM algorithm have been developed to make the EM algorithm feasible, they do not guarantee an optimal parallelization efficiency. In this paper, the authors propose a new parallel EM algorithm which maximizes the performance by optimizing data replication on a mesh-connected message-passing multiprocessor. To optimize data replication, the authors have formally derived the optimal allocation of shared data, group sizes, integration and broadcasting of replicated data as well as the scheduling of shared data accesses. The proposed parallel EM algorithm has been implemented on an iPSC/860 with 16 PEs. The experimental and theoretical results, which are consistent with each other, have shown that the proposed parallel EM algorithm could improve performance substantially over those using unoptimized data replication
Multigrid solution of diffusion equations on distributed memory multiprocessor systems
Finnemann, H.
1988-01-01
The subject is the solution of partial differential equations for simulation of the reactor core on high-performance computers. The parallelization and implementation of nodal multigrid diffusion algorithms on array and ring configurations of the DIRMU multiprocessor system is outlined. The particular iteration scheme employed in the nodal expansion method appears similarly efficient in serial and parallel environments. The combination of modern multi-level techniques with innovative hardware (vector-multiprocessor systems) provides powerful tools needed for real time simulation of physical systems. The parallel efficiencies range from 70 to 90%. The same performance is estimated for large problems on large multiprocessor systems being designed at present. (orig.) [de
Two General Extension Algorithms of Latin Hypercube Sampling
Zhi-zhao Liu
2015-01-01
Full Text Available For reserving original sampling points to reduce the simulation runs, two general extension algorithms of Latin Hypercube Sampling (LHS are proposed. The extension algorithms start with an original LHS of size m and construct a new LHS of size m+n that contains the original points as many as possible. In order to get a strict LHS of larger size, some original points might be deleted. The relationship of original sampling points in the new LHS structure is shown by a simple undirected acyclic graph. The basic general extension algorithm is proposed to reserve the most original points, but it costs too much time. Therefore, a general extension algorithm based on greedy algorithm is proposed to reduce the extension time, which cannot guarantee to contain the most original points. These algorithms are illustrated by an example and applied to evaluating the sample means to demonstrate the effectiveness.
Numerical implementation of the Dirac equation on hypercube multicomputers
Wells, J.C.
1991-01-01
Motivated by an interest in nonperturbative electromagnetic lepton-pair production in relativistic heavy-ion collisions, we discuss the numerical methods used in implementing a lattice solution of the time-dependent Dirac equation in three-dimensional Cartesian coordinates. Discretization is obtained using the lattice basis-spline collocation method, in which quantum-state vectors and coordinate-space operators are expressed in terms of basis-spline functions, and represented on a spatial lattice. All numerical procedures reduce to a series of matrix-vector operations which we perform on the Intel iPSC/860 hypercube multicomputer. We discuss solutions to the problems of limited node memory and node-to-node communication overhead inherent in using distributed-memory, multiple-instruction, multiple-data parallel computers
Efficient process migration in the EMPS multiprocessor system
van Dijk, G.J.W.; Gils, van M.J.
1992-01-01
The process migration facility in the Eindhoven multiprocessor system (EMPS) is presented. In the EMPS system, mailboxes are used for interprocess communication. These mailboxes provide transparency of location for communicating processes. The major advantages of mailbox communication in the EMPS
Multiprocessor shared-memory information exchange
Santoline, L.L.; Bowers, M.D.; Crew, A.W.; Roslund, C.J.; Ghrist, W.D. III
1989-01-01
In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, is designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange
Multiprocessor Priority Ceiling Emulation for Safety-Critical Java
Strøm, Torur Biskopstø; Schoeberl, Martin
2015-01-01
Priority ceiling emulation has preferable properties on uniprocessor systems, such as avoiding priority inversion and being deadlock free. This has made it a popular locking protocol. According to the safety-critical Java specication, priority ceiling emulation is a requirement for implementations....... However, implementing the protocol for multiprocessor systemsis more complex so implementations might perform worse than non-preemptive implementations. In this paper we compare two multiprocessor lock implementations with hardware support for the Java optimized processor: non-preemptive locking...
Hardware locks for a real-time Java chip multiprocessor
Strøm, Torur Biskopstø; Puffitsch, Wolfgang; Schoeberl, Martin
2016-01-01
A software locking mechanism commonly protects shared resources for multithreaded applications. This mechanism can, especially in chip-multiprocessor systems, result in a large synchronization overhead. For real-time systems in particular, this overhead increases the worst-case execution time....... This improvement can allow a larger number of real-time tasks to be reliably scheduled on a multiprocessor real-time platform....
Multiprocessor Global Scheduling on Frame-Based DVFS Systems
Berten, Vandy; Goossens, Joël
2008-01-01
International audience; In this work, we are interested in multiprocessor energy efficient systems where task durations are not known in advance but are known stochastically. More precisely we consider global scheduling algorithms for frame-based multiprocessor stochastic DVFS (Dynamic Voltage and Frequency Scaling) systems. Moreover we consider processors with a discrete set of available frequencies. We provide a global scheduling algorithm, and formally show that no deadline will ever be mi...
File-System Workload on a Scientific Multiprocessor
Kotz, David; Nieuwejaar, Nils
1995-01-01
Many scientific applications have intense computational and I/O requirements. Although multiprocessors have permitted astounding increases in computational performance, the formidable I/O needs of these applications cannot be met by current multiprocessors a their I/O subsystems. To prevent I/O subsystems from forever bottlenecking multiprocessors and limiting the range of feasible applications, new I/O subsystems must be designed. The successful design of computer systems (both hardware and software) depends on a thorough understanding of their intended use. A system designer optimizes the policies and mechanisms for the cases expected to most common in the user's workload. In the case of multiprocessor file systems, however, designers have been forced to build file systems based only on speculation about how they would be used, extrapolating from file-system characterizations of general-purpose workloads on uniprocessor and distributed systems or scientific workloads on vector supercomputers (see sidebar on related work). To help these system designers, in June 1993 we began the Charisma Project, so named because the project sought to characterize 1/0 in scientific multiprocessor applications from a variety of production parallel computing platforms and sites. The Charisma project is unique in recording individual read and write requests-in live, multiprogramming, parallel workloads (rather than from selected or nonparallel applications). In this article, we present the first results from the project: a characterization of the file-system workload an iPSC/860 multiprocessor running production, parallel scientific applications at NASA's Ames Research Center.
A Hypercube of Deep Sea Drilling Project (DSDP) Marine Geological and Geophysical Data
National Oceanic and Atmospheric Administration, Department of Commerce — Dr. Christopher Jenkins of the University of Colorado Institute for Arctic and Alpine Research (INSTAAR), produced this data hypercube derived from the prime data...
SDN Data Center Performance Evaluation of Torus and Hypercube Interconnecting Schemes
Andrus, Bogdan-Mihai; Vegas Olmos, Juan José; Mehmeri, Victor
2015-01-01
— By measuring throughput, delay, loss-rate and jitter, we present how SDN framework yields a 45% performance increase in highly interconnected topologies like torus and hypercube compared to current Layer2 switching technologies, applied to data center architectures......— By measuring throughput, delay, loss-rate and jitter, we present how SDN framework yields a 45% performance increase in highly interconnected topologies like torus and hypercube compared to current Layer2 switching technologies, applied to data center architectures...
On the analytical evaluation of the partition function for unit hypercubes in four dimensions
Hari Dass, N.D.
1984-10-01
The group integrations required for the analytic evaluation of the partition function for unit hypercubes in four dimensions are carried out. Modifications of the graphical rules for SU 2 group integrations cited in the literature are developed for this purpose. A complete classification of all surfaces that can be embedded in the unit hypercube is given and their individual contribution to the partition function worked out. Applications are discussed briefly. (orig.)
Analysis of hepatitis C viral dynamics using Latin hypercube sampling
Pachpute, Gaurav; Chakrabarty, Siddhartha P.
2012-12-01
We consider a mathematical model comprising four coupled ordinary differential equations (ODEs) to study hepatitis C viral dynamics. The model includes the efficacies of a combination therapy of interferon and ribavirin. There are two main objectives of this paper. The first one is to approximate the percentage of cases in which there is a viral clearance in absence of treatment as well as percentage of response to treatment for various efficacy levels. The other is to better understand and identify the parameters that play a key role in the decline of viral load and can be estimated in a clinical setting. A condition for the stability of the uninfected and the infected steady states is presented. A large number of sample points for the model parameters (which are physiologically feasible) are generated using Latin hypercube sampling. An analysis of the simulated values identifies that, approximately 29.85% cases result in clearance of the virus during the early phase of the infection. Results from the χ2 and the Spearman's tests done on the samples, indicate a distinctly different distribution for certain parameters for the cases exhibiting viral clearance under the combination therapy.
Prefetching in file systems for MIMD multiprocessors
Kotz, David F.; Ellis, Carla Schlatter
1990-01-01
The question of whether prefetching blocks on the file into the block cache can effectively reduce overall execution time of a parallel computation, even under favorable assumptions, is considered. Experiments have been conducted with an interleaved file system testbed on the Butterfly Plus multiprocessor. Results of these experiments suggest that (1) the hit ratio, the accepted measure in traditional caching studies, may not be an adequate measure of performance when the workload consists of parallel computations and parallel file access patterns, (2) caching with prefetching can significantly improve the hit ratio and the average time to perform an I/O (input/output) operation, and (3) an improvement in overall execution time has been observed in most cases. In spite of these gains, prefetching sometimes results in increased execution times (a negative result, given the optimistic nature of the study). The authors explore why it is not trivial to translate savings on individual I/O requests into consistently better overall performance and identify the key problems that need to be addressed in order to improve the potential of prefetching techniques in the environment.
FTMP (Fault Tolerant Multiprocessor) programmer's manual
Feather, F. E.; Liceaga, C. A.; Padilla, P. A.
1986-01-01
The Fault Tolerant Multiprocessor (FTMP) computer system was constructed using the Rockwell/Collins CAPS-6 processor. It is installed in the Avionics Integration Research Laboratory (AIRLAB) of NASA Langley Research Center. It is hosted by AIRLAB's System 10, a VAX 11/750, for the loading of programs and experimentation. The FTMP support software includes a cross compiler for a high level language called Automated Engineering Design (AED) System, an assembler for the CAPS-6 processor assembly language, and a linker. Access to this support software is through an automated remote access facility on the VAX which relieves the user of the burden of learning how to use the IBM 4381. This manual is a compilation of information about the FTMP support environment. It explains the FTMP software and support environment along many of the finer points of running programs on FTMP. This will be helpful to the researcher trying to run an experiment on FTMP and even to the person probing FTMP with fault injections. Much of the information in this manual can be found in other sources; we are only attempting to bring together the basic points in a single source. If the reader should need points clarified, there is a list of support documentation in the back of this manual.
Operating System for Runtime Reconfigurable Multiprocessor Systems
Diana Göhringer
2011-01-01
Full Text Available Operating systems traditionally handle the task scheduling of one or more application instances on processor-like hardware architectures. RAMPSoC, a novel runtime adaptive multiprocessor System-on-Chip, exploits the dynamic reconfiguration on FPGAs to generate, start and terminate hardware and software tasks. The hardware tasks have to be transferred to the reconfigurable hardware via a configuration access port. The software tasks can be loaded into the local memory of the respective IP core either via the configuration access port or via the on-chip communication infrastructure (e.g. a Network-on-Chip. Recent-series of Xilinx FPGAs, such as Virtex-5, provide two Internal Configuration Access Ports, which cannot be accessed simultaneously. To prevent conflicts, the access to these ports as well as the hardware resource management needs to be controlled, e.g. by a special-purpose operating system running on an embedded processor. For that purpose and to handle the relations between temporally and spatially scheduled operations, the novel approach of an operating system is of high importance. This special purpose operating system, called CAP-OS (Configuration Access Port-Operating System, which will be presented in this paper, supports the clients using the configuration port with the services of priority-based access scheduling, hardware task mapping and resource management.
Performance modeling of parallel algorithms for solving neutron diffusion problems
Azmy, Y.Y.; Kirk, B.L.
1995-01-01
Neutron diffusion calculations are the most common computational methods used in the design, analysis, and operation of nuclear reactors and related activities. Here, mathematical performance models are developed for the parallel algorithm used to solve the neutron diffusion equation on message passing and shared memory multiprocessors represented by the Intel iPSC/860 and the Sequent Balance 8000, respectively. The performance models are validated through several test problems, and these models are used to estimate the performance of each of the two considered architectures in situations typical of practical applications, such as fine meshes and a large number of participating processors. While message passing computers are capable of producing speedup, the parallel efficiency deteriorates rapidly as the number of processors increases. Furthermore, the speedup fails to improve appreciably for massively parallel computers so that only small- to medium-sized message passing multiprocessors offer a reasonable platform for this algorithm. In contrast, the performance model for the shared memory architecture predicts very high efficiency over a wide range of number of processors reasonable for this architecture. Furthermore, the model efficiency of the Sequent remains superior to that of the hypercube if its model parameters are adjusted to make its processors as fast as those of the iPSC/860. It is concluded that shared memory computers are better suited for this parallel algorithm than message passing computers
Best Speed Fit EDF Scheduling for Performance Asymmetric Multiprocessors
Peng Wu
2017-01-01
Full Text Available In order to improve the performance of a real-time system, asymmetric multiprocessors have been proposed. The benefits of improved system performance and reduced power consumption from such architectures cannot be fully exploited unless suitable task scheduling and task allocation approaches are implemented at the operating system level. Unfortunately, most of the previous research on scheduling algorithms for performance asymmetric multiprocessors is focused on task priority assignment. They simply assign the highest priority task to the fastest processor. In this paper, we propose BSF-EDF (best speed fit for earliest deadline first for performance asymmetric multiprocessor scheduling. This approach chooses a suitable processor rather than the fastest one, when allocating tasks. With this proposed BSF-EDF scheduling, we also derive an effective schedulability test.
Real-Time Multiprocessor Programming Language (RTMPL) user's manual
Arpasi, D. J.
1985-01-01
A real-time multiprocessor programming language (RTMPL) has been developed to provide for high-order programming of real-time simulations on systems of distributed computers. RTMPL is a structured, engineering-oriented language. The RTMPL utility supports a variety of multiprocessor configurations and types by generating assembly language programs according to user-specified targeting information. Many programming functions are assumed by the utility (e.g., data transfer and scaling) to reduce the programming chore. This manual describes RTMPL from a user's viewpoint. Source generation, applications, utility operation, and utility output are detailed. An example simulation is generated to illustrate many RTMPL features.
Chip-Multiprocessor Hardware Locks for Safety-Critical Java
Strøm, Torur Biskopstø; Puffitsch, Wolfgang; Schoeberl, Martin
2013-01-01
and may void a task set's schedulability. In this paper we present a hardware locking mechanism to reduce the synchronization overhead. The solution is implemented for the chip-multiprocessor version of the Java Optimized Processor in the context of safety-critical Java. The implementation is compared...
Stream-processing pipelines: processing of streams on multiprocessor architecture
Kavaldjiev, N.K.; Smit, Gerardus Johannes Maria; Jansen, P.G.
In this paper we study the timing aspects of the operation of stream-processing applications that run on a multiprocessor architecture. Dependencies are derived for the processing and communication times of the processors in such a system. Three cases of real-time constrained operation and four
Software for event oriented processing on multiprocessor systems
Fischler, M.; Areti, H.; Biel, J.; Bracker, S.; Case, G.; Gaines, I.; Husby, D.; Nash, T.
1984-08-01
Computing intensive problems that require the processing of numerous essentially independent events are natural customers for large scale multi-microprocessor systems. This paper describes the software required to support users with such problems in a multiprocessor environment. It is based on experience with and development work aimed at processing very large amounts of high energy physics data
Molecular dynamics simulations of short-range force systems on 1024-node hypercubes
Plimpton, S.J.
1990-01-01
In this paper, two parallel algorithms for classical molecular dynamics are presented. The first assigns each processor to a subset of particles; the second assigns each to a fixed region of 3d space. The algorithms are implemented on 1024-node hypercubes for problems characterized by short-range forces, diffusion (so that each particle's neighbors change in time), and problem size ranging from 250 to 10000 particles. Timings for the algorithms on the 1024-node NCUBE/ten and the newer NCUBE 2 hypercubes are given. The latter is found to be competitive with a CRAY-XMP, running an optimized serial algorithm. For smaller problems the NCUBE 2 and CRAY-XMP are roughly the same; for larger ones the NCUBE 2 is up to twice as fast. Parallel efficiencies of the algorithms and communication parameters for the two hypercubes are also examined
Parallel spatial direct numerical simulations on the Intel iPSC/860 hypercube
Joslin, Ronald D.; Zubair, Mohammad
1993-01-01
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube is documented. The direct numerical simulation approach is used to compute spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows. The feasibility of using the PSDNS on the hypercube to perform transition studies is examined. The results indicate that the direct numerical simulation approach can effectively be parallelized on a distributed-memory parallel machine. By increasing the number of processors nearly ideal linear speedups are achieved with nonoptimized routines; slower than linear speedups are achieved with optimized (machine dependent library) routines. This slower than linear speedup results because the Fast Fourier Transform (FFT) routine dominates the computational cost and because the routine indicates less than ideal speedups. However with the machine-dependent routines the total computational cost decreases by a factor of 4 to 5 compared with standard FORTRAN routines. The computational cost increases linearly with spanwise wall-normal and streamwise grid refinements. The hypercube with 32 processors was estimated to require approximately twice the amount of Cray supercomputer single processor time to complete a comparable simulation; however it is estimated that a subgrid-scale model which reduces the required number of grid points and becomes a large-eddy simulation (PSLES) would reduce the computational cost and memory requirements by a factor of 10 over the PSDNS. This PSLES implementation would enable transition simulations on the hypercube at a reasonable computational cost.
Correlation inequalities for two-component hypercubic φ4 models. Pt. 2
Soria, J.L.; Instituto Tecnologico de Tijuana
1990-01-01
We continue the program started in the first paper (J. Stat. Phys. 52 (1988) 711-726). We find new and already known correlation inequalities for a family of two-component hypercubic φ 4 models, using techniques of rotated correlation inequalities and random walk representation. (orig.)
Mulder, V.L.; Bruin, de S.; Schaepman, M.E.
2013-01-01
This paper presents a sparse, remote sensing-based sampling approach making use of conditioned Latin Hypercube Sampling (cLHS) to assess variability in soil properties at regional scale. The method optimizes the sampling scheme for a defined spatial population based on selected covariates, which are
A parallel simulated annealing algorithm for standard cell placement on a hypercube computer
Jones, Mark Howard
1987-01-01
A parallel version of a simulated annealing algorithm is presented which is targeted to run on a hypercube computer. A strategy for mapping the cells in a two dimensional area of a chip onto processors in an n-dimensional hypercube is proposed such that both small and large distance moves can be applied. Two types of moves are allowed: cell exchanges and cell displacements. The computation of the cost function in parallel among all the processors in the hypercube is described along with a distributed data structure that needs to be stored in the hypercube to support parallel cost evaluation. A novel tree broadcasting strategy is used extensively in the algorithm for updating cell locations in the parallel environment. Studies on the performance of the algorithm on example industrial circuits show that it is faster and gives better final placement results than the uniprocessor simulated annealing algorithms. An improved uniprocessor algorithm is proposed which is based on the improved results obtained from parallelization of the simulated annealing algorithm.
Scientific applications and numerical algorithms on the midas multiprocessor system
Logan, D.; Maples, C.
1986-01-01
The MIDAS multiprocessor system is a multi-level, hierarchial structure designed at the Advanced Computer Architecture Laboratory of the University of California's Lawrence Berkeley Laboratory. A two-stage, 11-processor system has been operational for over a year and is currently undergoing expansion. It has been employed to investigate the performance of different methods of decomposing various problems and algorithms into a multiprocessor environment. The results of such tests on a variety of applications such as scientific data analysis, Monte Carlo calculations, and image processing, are discussed. Often such decompositions involve investigating the parallel structure of fundamental algorithms. Several basic algorithms dealing with random number generation, matrix diagonalization, fast Fourier transforms, and finite element methods in solving partial differential equations are also discussed. The performance and projected extensibilities of these decompositions on the MIDAS system are reported
Cache-aware network-on-chip for chip multiprocessors
Tatas, Konstantinos; Kyriacou, Costas; Dekoulis, George; Demetriou, Demetris; Avraam, Costas; Christou, Anastasia
2009-05-01
This paper presents the hardware prototype of a Network-on-Chip (NoC) for a chip multiprocessor that provides support for cache coherence, cache prefetching and cache-aware thread scheduling. A NoC with support to these cache related mechanisms can assist in improving systems performance by reducing the cache miss ratio. The presented multi-core system employs the Data-Driven Multithreading (DDM) model of execution. In DDM thread scheduling is done according to data availability, thus the system is aware of the threads to be executed in the near future. This characteristic of the DDM model allows for cache aware thread scheduling and cache prefetching. The NoC prototype is a crossbar switch with output buffering that can support a cache-aware 4-node chip multiprocessor. The prototype is built on the Xilinx ML506 board equipped with a Xilinx Virtex-5 FPGA.
Advanced lectures on multiprocessor programming (1/3)
CERN. Geneva
2011-01-01
Three classes (60 mins) on Multiprocessor Programming Prof. Dr. Christoph von Praun Georg-Simon-Ohm University of Applied Sciences Nuremberg, Germany This is an advanced class on multiprocessor programming. The class gives an introduction to principles of concurrent objects and the notion of different progress guarantees that concurrent computations can have. The focus of this class is on non-blocking computations, i.e. concurrent programs that do not make use of locks. We discuss the implementation of practical non-blocking data structures in detail. 1st class: Introduction to concurrent objects 2nd class: Principles of non-blocking synchronization 3rd class: Concurrent queues Brief Bio of Christoph von Praun Christoph worked on a variety of analysis techniques and runtime platforms for parallel programs. Hist most recent research studies programming models and tools that support transactional synchronization. In prior work, which he also did at the IBM T.J. Watson Research Center in Yorktown Height...
Plasma physics modeling and the Cray-2 multiprocessor
Killeen, J.
1985-01-01
The importance of computer modeling in the magnetic fusion energy research program is discussed. The need for the most advanced supercomputers is described. To meet the demand for more powerful scientific computers to solve larger and more complicated problems, the computer industry is developing multiprocessors. The role of the Cray-2 in plasma physics modeling is discussed with some examples. 28 refs., 2 figs., 1 tab
Multiprocessor Real-Time Scheduling with Hierarchical Processor Affinities
Bonifaci , Vincenzo; Brandenburg , Björn; D'Angelo , Gianlorenzo; Marchetti-Spaccamela , Alberto
2016-01-01
International audience; Many multiprocessor real-time operating systems offer the possibility to restrict the migrations of any task to a specified subset of processors by setting affinity masks. A notion of " strong arbitrary processor affinity scheduling " (strong APA scheduling) has been proposed; this notion avoids schedulability losses due to overly simple implementations of processor affinities. Due to potential overheads, strong APA has not been implemented so far in a real-time operat...
2: Local area networks as a multiprocessor treatment planning system
Neblett, D.L.; Hogan, S.E.
1987-01-01
The creation of a local area network (LAN) of interconnected computers provides an environment of multi computer processors that adds a new dimension to treatment planning. A LAN system provides the opportunity to have two or more computers working on the plan in parallel. With high speed interprocessor transfer, events such as the time consuming task of correcting several individual beams for contours and inhomogeneities can be performed simultaneously; thus, effectively creating a parallel multiprocessor treatment planning system
Design considerations for a multiprocessor based data acquisition system
Tippie, J.W.; Kulaga, J.E.
1979-01-01
The rapid advance of digital technology has provided the systems designer with many new design options. Hardware is no longer the controlling expense. Complex operating systems provide the flexibility and development tools needed by software designers, but restrict throughput. Multiprocessor-based systems can be used to ''front-end'' high-throughput applications while maintaining the many advantages offered by multitasking operating systems. The design of a high-throughput data acquisition system for application in low energy nuclear physics is considered
Multiprocessor Real-Time Locking Protocols for Replicated Resources
2016-07-01
assignment problem, the ac- tual identities of the allocated replicas must be known. When locking protocols are used, tasks may experience delays due to both...Multiprocessor Real-Time Locking Protocols for Replicated Resources ∗ Catherine E. Jarrett1, Kecheng Yang1, Ming Yang1, Pontus Ekberg2, and James H...replicas to execute. In prior work on replicated resources, k-exclusion locks have been used, but this restricts tasks to lock only one replica at a time. To
GOTHIC memory management : a multiprocessor shared single level store
Michel , Béatrice
1990-01-01
Gothic purpose is to build an object-oriented fault-tolerant distributed operating system for a local area network of multiprocessor workstations. This paper describes Gothic memory manager. It realizes the sharing of the secondary memory space between any process running on the Gothic system. Processes on different processors can communicate by sharing permanent information. The manager implements a shared single level storage with an invalidation protocol working on disk-pages to maintain s...
A simple multiprocessor management system for event-parallel computing
Bracker, S.; Gounder, K.; Hendrix, K.; Summers, D.
1996-01-01
Offline software using Transmission Control Protocol/Internet Protocol (TCP/IP) sockets to distribute particle physics events to multiple UNIX/RISC workstations is described. A modular, building block approach was taken that allowed tailoring to solve specific tasks efficiently and simply as they arose. The modest, initial cost was having to learn about sockets for interprocess communication. This multiprocessor management software has been used to control the reconstruction of eight billion raw data events from Fermilab Experiment E791
Control and Reliability of Optical Networks in Multiprocessors
Olsen, James Jonathan
1993-01-01
Optical communication links have great potential to improve the performance of interconnection networks within large parallel multiprocessors, but the problems of semiconductor laser drive control and reliability inhibit their wide use. These problems have been solved in the telecommunications context, but the telecommunications solutions, based on a small number of links, are often too bulky, complex, power-hungry, and expensive to be feasible for use in a multiprocessor network with thousands of optical links. The main problems with the telecommunications approaches are that they are, by definition, designed for long-distance communication and therefore deal with communications links in isolation, instead of in an overall systems context. By taking a system-level approach to solving the laser reliability problem in a multiprocessor, and by exploiting the short -distance nature of the links, one can achieve small, simple, low-power, and inexpensive solutions, practical for implementation in the thousands of optical links that might be used in a multiprocessor. Through modeling and experimentation, I demonstrate that such system-level solutions exist, and are feasible for use in a multiprocessor network. I divide semiconductor laser reliability problems into two classes: transient errors and hard failures, and develop solutions to each type of problem in the context of a large multiprocessor. I find that for transient errors, the computer system would require a very low bit-error-rate (BER), such as 10^{-23}, if no provision were made for error control. Optical links cannot achieve such rates directly, but I find that a much more reasonable link-level BER (such as 10^{-7} ) would be acceptable with simple error detection coding. I then propose a feedback system that will enable lasers to achieve these error levels even when laser threshold current varies. Instead of telecommunications techniques, which require laser output power monitors, I describe a software
Cyclic executive for safety-critical Java on chip-multiprocessors
Ravn, Anders P.; Schoeberl, Martin
2010-01-01
, that uses model checking to find a static schedule, if one exists at all, which gives an implementation of a table driven multiprocessor scheduler. To evaluate the proposed cyclic executive for multiprocessors we have implemented it in the context of safety-critical Java on a Java processor....
Performance of a parallel algorithm for solving the neutron diffusion equation on the hypercube
Kirk, B.L.; Azmy, Y.Y.
1989-01-01
The one-group, steady state neutron diffusion equation in two- dimensional Cartesian geometry is solved using the nodal method technique. By decoupling sets of equations representing the neutron current continuity along the length of rows and columns of computational cells a new iterative algorithm is derived that is more suitable to solving large practical problems. This algorithm is highly parallelizable and is implemented on the Intel iPSC/2 hypercube in three versions which differ essentially in the total size of communicated data. Even though speedup was achieved, the efficiency is very low when many processors are used leading to the conclusion that the hypercube is not as well suited for this algorithm as shared memory machines. 10 refs., 1 fig., 3 tabs
Finite difference time domain solution of electromagnetic scattering on the hypercube
Calalo, R.H.; Lyons, J.R.; Imbriale, W.A.
1988-01-01
Electromagnetic fields interacting with a dielectric or conducting structure produce scattered electromagnetic fields. To model the fields produced by complicated, volumetric structures, the finite difference time domain (FDTD) method employs an iterative solution to Maxwell's time dependent curl equations. Implementations of the FDTD method intensively use memory and perform numerous calculations per time step iteration. The authors have implemented an FDTD code on the California Institute of Technology/Jet Propulsion Laboratory Mark III Hypercube. This code allows to solve problems requiring as many as 2,048,000 unit cells on a 32 node Hypercube. For smaller problems, the code produces solutions in a fraction of the time to solve the same problems on sequential computers
Bootstrapping hypercubic and hypertetrahedral theories in three dimensions arXiv
Stergiou, Andreas
There are three generalizations of the Platonic solids that exist in all dimensions, namely the hypertetrahedron, the hypercube, and the hyperoctahedron, with the latter two being dual. Conformal field theories with the associated symmetry groups as global symmetries can be argued to exist in $d=3$ spacetime dimensions if the $\\varepsilon=4-d$ expansion is valid when $\\varepsilon\\to1$. In this paper hypercubic and hypertetrahedral theories are studied with the non-perturbative numerical conformal bootstrap. In the $N=3$ cubic case it is found that a bound with a kink is saturated by a solution with properties that cannot be reconciled with the $\\varepsilon$ expansion of the cubic theory. Possible implications for cubic magnets and structural phase transitions are discussed. For the hypertetrahedral theory evidence is found that the non-conformal window that is seen with the $\\varepsilon$ expansion exists in $d=3$ as well, and a rough estimate of its extent is given.
Accurate calculation of Green functions on the d-dimensional hypercubic lattice
Loh, Yen Lee
2011-01-01
We write the Green function of the d-dimensional hypercubic lattice in a piecewise form covering the entire real frequency axis. Each piece is a single integral involving modified Bessel functions of the first and second kinds. The smoothness of the integrand allows both real and imaginary parts of the Green function to be computed quickly and accurately for any dimension d and any real frequency, and the computational time scales only linearly with d.
Implementation of a 3-D nonlinear MHD [magnetohydrodynamics] calculation on the Intel hypercube
Lynch, V.E.; Carreras, B.A.; Drake, J.B.; Hicks, H.R.; Lawkins, W.F.
1987-01-01
The optimization of numerical schemes and increasing computer capabilities in the last ten years have improved the efficiency of 3-D nonlinear resistive MHD calculations by about two to three orders of magnitude. However, we are still very limited in performing these types of calculations. Hypercubes have a large number of processors with only local memory and bidirectional links among neighbors. The Intel Hypercube at Oak Ridge has 64 processors with 0.5 megabytes of memory per processor. The multiplicity of processors opens new possibilities for the treatment of such computations. The constraint on time and resources favored the approach of using the existing RSF code which solves as an initial value problem the reduced set of MHD equations for a periodic cylindrical geometry. This code includes minimal physics and geometry, but contains the basic three dimensionality and nonlinear structure of the equations. The code solves the reduced set of MHD equations by Fourier expansion in two angular coordinates and finite differences in the radial one. Due to the continuing interest in these calculations and the likelihood that future supercomputers will take greater advantage of parallelism, the present study was initiated by the ORNL Exploratory Studies Committee and funded entirely by Laboratory Discretionary Funds. The objectives of the study were: to ascertain the suitability of MHD calculation for parallel computation, to design and implement a parallel algorithm to perform the computations, and to evaluate the hypercube, and in particular, ORNL's Intel iPSC, for use in MHD computations
HyperCube: A Small Lensless Position Sensing Device for the Tracking of Flickering Infrared LEDs
Thibaut Raharijaona
2015-07-01
Full Text Available An innovative insect-based visual sensor is designed to perform active marker tracking. Without any optics and a field-of-view of about 60°, a novel miniature visual sensor is able to locate flickering markers (LEDs with an accuracy much greater than the one dictated by the pixel pitch. With a size of only 1 cm3 and a mass of only 0.33 g, the lensless sensor, called HyperCube, is dedicated to 3D motion tracking and fits perfectly with the drastic constraints imposed by micro-aerial vehicles. Only three photosensors are placed on each side of the cubic configuration of the sensing device, making this sensor very inexpensive and light. HyperCube provides the azimuth and elevation of infrared LEDs flickering at a high frequency (>1 kHz with a precision of 0.5°. The minimalistic design in terms of small size, low mass and low power consumption of this visual sensor makes it suitable for many applications in the field of the cooperative flight of unmanned aerial vehicles and, more generally, robotic applications requiring active beacons. Experimental results show that HyperCube provides useful angular measurements that can be used to estimate the relative position between the sensor and the flickering infrared markers.
HyperCube: A Small Lensless Position Sensing Device for the Tracking of Flickering Infrared LEDs.
Raharijaona, Thibaut; Mignon, Paul; Juston, Raphaël; Kerhuel, Lubin; Viollet, Stéphane
2015-07-08
An innovative insect-based visual sensor is designed to perform active marker tracking. Without any optics and a field-of-view of about 60°, a novel miniature visual sensor is able to locate flickering markers (LEDs) with an accuracy much greater than the one dictated by the pixel pitch. With a size of only 1 cm3 and a mass of only 0.33 g, the lensless sensor, called HyperCube, is dedicated to 3D motion tracking and fits perfectly with the drastic constraints imposed by micro-aerial vehicles. Only three photosensors are placed on each side of the cubic configuration of the sensing device, making this sensor very inexpensive and light. HyperCube provides the azimuth and elevation of infrared LEDs flickering at a high frequency (>1 kHz) with a precision of 0.5°. The minimalistic design in terms of small size, low mass and low power consumption of this visual sensor makes it suitable for many applications in the field of the cooperative flight of unmanned aerial vehicles and, more generally, robotic applications requiring active beacons. Experimental results show that HyperCube provides useful angular measurements that can be used to estimate the relative position between the sensor and the flickering infrared markers.
Algorithms for parallel flow solvers on message passing architectures
Vanderwijngaart, Rob F.
1995-01-01
The purpose of this project has been to identify and test suitable technologies for implementation of fluid flow solvers -- possibly coupled with structures and heat equation solvers -- on MIMD parallel computers. In the course of this investigation much attention has been paid to efficient domain decomposition strategies for ADI-type algorithms. Multi-partitioning derives its efficiency from the assignment of several blocks of grid points to each processor in the parallel computer. A coarse-grain parallelism is obtained, and a near-perfect load balance results. In uni-partitioning every processor receives responsibility for exactly one block of grid points instead of several. This necessitates fine-grain pipelined program execution in order to obtain a reasonable load balance. Although fine-grain parallelism is less desirable on many systems, especially high-latency networks of workstations, uni-partition methods are still in wide use in production codes for flow problems. Consequently, it remains important to achieve good efficiency with this technique that has essentially been superseded by multi-partitioning for parallel ADI-type algorithms. Another reason for the concentration on improving the performance of pipeline methods is their applicability in other types of flow solver kernels with stronger implied data dependence. Analytical expressions can be derived for the size of the dynamic load imbalance incurred in traditional pipelines. From these it can be determined what is the optimal first-processor retardation that leads to the shortest total completion time for the pipeline process. Theoretical predictions of pipeline performance with and without optimization match experimental observations on the iPSC/860 very well. Analysis of pipeline performance also highlights the effect of uncareful grid partitioning in flow solvers that employ pipeline algorithms. If grid blocks at boundaries are not at least as large in the wall-normal direction as those immediately adjacent to them, then the first processor in the pipeline will receive a computational load that is less than that of subsequent processors, magnifying the pipeline slowdown effect. Extra compensation is needed for grid boundary effects, even if all grid blocks are equally sized.
Optimizing spread dynamics on graphs by message passing
Altarelli, F; Braunstein, A; Dall’Asta, L; Zecchina, R
2013-01-01
Cascade processes are responsible for many important phenomena in natural and social sciences. Simple models of irreversible dynamics on graphs, in which nodes activate depending on the state of their neighbors, have been successfully applied to describe cascades in a large variety of contexts. Over the past decades, much effort has been devoted to understanding the typical behavior of the cascades arising from initial conditions extracted at random from some given ensemble. However, the problem of optimizing the trajectory of the system, i.e. of identifying appropriate initial conditions to maximize (or minimize) the final number of active nodes, is still considered to be practically intractable, with the only exception being models that satisfy a sort of diminishing returns property called submodularity. Submodular models can be approximately solved by means of greedy strategies, but by definition they lack cooperative characteristics which are fundamental in many real systems. Here we introduce an efficient algorithm based on statistical physics for the optimization of trajectories in cascade processes on graphs. We show that for a wide class of irreversible dynamics, even in the absence of submodularity, the spread optimization problem can be solved efficiently on large networks. Analytic and algorithmic results on random graphs are complemented by the solution of the spread maximization problem on a real-world network (the Epinions consumer reviews network). (paper)
Discrete geometric analysis of message passing algorithm on graphs
Watanabe, Yusuke
2010-04-01
We often encounter probability distributions given as unnormalized products of non-negative functions. The factorization structures are represented by hypergraphs called factor graphs. Such distributions appear in various fields, including statistics, artificial intelligence, statistical physics, error correcting codes, etc. Given such a distribution, computations of marginal distributions and the normalization constant are often required. However, they are computationally intractable because of their computational costs. One successful approximation method is Loopy Belief Propagation (LBP) algorithm. The focus of this thesis is an analysis of the LBP algorithm. If the factor graph is a tree, i.e. having no cycle, the algorithm gives the exact quantities. If the factor graph has cycles, however, the LBP algorithm does not give exact results and possibly exhibits oscillatory and non-convergent behaviors. The thematic question of this thesis is "How the behaviors of the LBP algorithm are affected by the discrete geometry of the factor graph?" The primary contribution of this thesis is the discovery of a formula that establishes the relation between the LBP, the Bethe free energy and the graph zeta function. This formula provides new techniques for analysis of the LBP algorithm, connecting properties of the graph and of the LBP and the Bethe free energy. We demonstrate applications of the techniques to several problems including (non) convexity of the Bethe free energy, the uniqueness and stability of the LBP fixed point. We also discuss the loop series initiated by Chertkov and Chernyak. The loop series is a subgraph expansion of the normalization constant, or partition function, and reflects the graph geometry. We investigate theoretical natures of the series. Moreover, we show a partial connection between the loop series and the graph zeta function.
Optimizing spread dynamics on graphs by message passing
Altarelli, F.; Braunstein, A.; Dall'Asta, L.; Zecchina, R.
2013-09-01
Cascade processes are responsible for many important phenomena in natural and social sciences. Simple models of irreversible dynamics on graphs, in which nodes activate depending on the state of their neighbors, have been successfully applied to describe cascades in a large variety of contexts. Over the past decades, much effort has been devoted to understanding the typical behavior of the cascades arising from initial conditions extracted at random from some given ensemble. However, the problem of optimizing the trajectory of the system, i.e. of identifying appropriate initial conditions to maximize (or minimize) the final number of active nodes, is still considered to be practically intractable, with the only exception being models that satisfy a sort of diminishing returns property called submodularity. Submodular models can be approximately solved by means of greedy strategies, but by definition they lack cooperative characteristics which are fundamental in many real systems. Here we introduce an efficient algorithm based on statistical physics for the optimization of trajectories in cascade processes on graphs. We show that for a wide class of irreversible dynamics, even in the absence of submodularity, the spread optimization problem can be solved efficiently on large networks. Analytic and algorithmic results on random graphs are complemented by the solution of the spread maximization problem on a real-world network (the Epinions consumer reviews network).
The ACP (Advanced Computer Program) multiprocessor system at Fermilab
Nash, T.; Areti, H.; Atac, R.; Biel, J.; Case, G.; Cook, A.; Fischler, M.; Gaines, I.; Hance, R.; Husby, D.
1986-09-01
The Advanced Computer Program at Fermilab has developed a multiprocessor system which is easy to use and uniquely cost effective for many high energy physics problems. The system is based on single board computers which cost under $2000 each to build including 2 Mbytes of on board memory. These standard VME modules each run experiment reconstruction code in Fortran at speeds approaching that of a VAX 11/780. Two versions have been developed: one uses Motorola's 68020 32 bit microprocessor, the other runs with AT and T's 32100. both include the corresponding floating point coprocessor chip. The first system, when fully configured, uses 70 each of the two types of processors. A 53 processor system has been operated for several months with essentially no down time by computer operators in the Fermilab Computer Center, performing at nearly the capacity of 6 CDC Cyber 175 mainframe computers. The VME crates in which the processing ''nodes'' sit are connected via a high speed ''Branch Bus'' to one or more MicroVAX computers which act as hosts handling system resource management and all I/O in offline applications. An interface from Fastbus to the Branch Bus has been developed for online use which has been tested error free at 20 Mbytes/sec for 48 hours. ACP hardware modules are now available commercially. A major package of software, including a simulator that runs on any VAX, has been developed. It allows easy migration of existing programs to this multiprocessor environment. This paper describes the ACP Multiprocessor System and early experience with it at Fermilab and elsewhere.
The ACP [Advanced Computer Program] multiprocessor system at Fermilab
Nash, T.; Areti, H.; Atac, R.
1986-09-01
The Advanced Computer Program at Fermilab has developed a multiprocessor system which is easy to use and uniquely cost effective for many high energy physics problems. The system is based on single board computers which cost under $2000 each to build including 2 Mbytes of on board memory. These standard VME modules each run experiment reconstruction code in Fortran at speeds approaching that of a VAX 11/780. Two versions have been developed: one uses Motorola's 68020 32 bit microprocessor, the other runs with AT and T's 32100. both include the corresponding floating point coprocessor chip. The first system, when fully configured, uses 70 each of the two types of processors. A 53 processor system has been operated for several months with essentially no down time by computer operators in the Fermilab Computer Center, performing at nearly the capacity of 6 CDC Cyber 175 mainframe computers. The VME crates in which the processing ''nodes'' sit are connected via a high speed ''Branch Bus'' to one or more MicroVAX computers which act as hosts handling system resource management and all I/O in offline applications. An interface from Fastbus to the Branch Bus has been developed for online use which has been tested error free at 20 Mbytes/sec for 48 hours. ACP hardware modules are now available commercially. A major package of software, including a simulator that runs on any VAX, has been developed. It allows easy migration of existing programs to this multiprocessor environment. This paper describes the ACP Multiprocessor System and early experience with it at Fermilab and elsewhere
Multiprocessor system with multiple concurrent modes of execution
Ahn, Daniel; Ceze, Luis H; Chen, Dong; Gara, Alan; Heidelberger, Philip; Ohmacht, Martin
2013-12-31
A multiprocessor system supports multiple concurrent modes of speculative execution. Speculation identification numbers (IDs) are allocated to speculative threads from a pool of available numbers. The pool is divided into domains, with each domain being assigned to a mode of speculation. Modes of speculation include TM, TLS, and rollback. Allocation of the IDs is carried out with respect to a central state table and using hardware pointers. The IDs are used for writing different versions of speculative results in different ways of a set in a cache memory.
Software for the ACP [Advanced Computer Program] multiprocessor system
Biel, J.; Areti, H.; Atac, R.
1987-01-01
Software has been developed for use with the Fermilab Advanced Computer Program (ACP) multiprocessor system. The software was designed to make a system of a hundred independent node processors as easy to use as a single, powerful CPU. Subroutines have been developed by which a user's host program can send data to and get results from the program running in each of his ACP node processors. Utility programs make it easy to compile and link host and node programs, to debug a node program on an ACP development system, and to submit a debugged program to an ACP production system
A combined PLC and CPU approach to multiprocessor control
Harris, J.J.; Broesch, J.D.; Coon, R.M.
1995-10-01
A sophisticated multiprocessor control system has been developed for use in the E-Power Supply System Integrated Control (EPSSIC) on the DIII-D tokamak. EPSSIC provides control and interlocks for the ohmic heating coil power supply and its associated systems. Of particular interest is the architecture of this system: both a Programmable Logic Controller (PLC) and a Central Processor Unit (CPU) have been combined on a standard VME bus. The PLC and CPU input and output signals are routed through signal conditioning modules, which provide the necessary voltage and ground isolation. Additionally these modules adapt the signal levels to that of the VME I/O boards. One set of I/O signals is shared between the two processors. The resulting multiprocessor system provides a number of advantages: redundant operation for mission critical situations, flexible communications using conventional TCP/IP protocols, the simplicity of ladder logic programming for the majority of the control code, and an easily maintained and expandable non-proprietary system
Elastic pointer directory organization for scalable shared memory multiprocessors
Yuhang Liu; Mingfa Zhu; Limin Xiao
2014-01-01
In the field of supercomputing, one key issue for scal-able shared-memory multiprocessors is the design of the directory which denotes the sharing state for a cache block. A good direc-tory design intends to achieve three key attributes: reasonable memory overhead, sharer position precision and implementation complexity. However, researchers often face the problem that gain-ing one attribute may result in losing another. The paper proposes an elastic pointer directory (EPD) structure based on the analysis of shared-memory applications, taking the fact that the number of sharers for each directory entry is typical y smal . Analysis re-sults show that for 4 096 nodes, the ratio of memory overhead to the ful-map directory is 2.7%. Theoretical analysis and cycle-accurate execution-driven simulations on a 16 and 64-node cache coherence non uniform memory access (CC-NUMA) multiproces-sor show that the corresponding pointer overflow probability is reduced significantly. The performance is observed to be better than that of a limited pointers directory and almost identical to the ful-map directory, except for the slight implementation complex-ity. Using the directory cache to explore directory access locality is also studied. The experimental result shows that this is a promis-ing approach to be used in the state-of-the-art high performance computing domain.
Thermal-Aware Scheduling for Future Chip Multiprocessors
Pedro Trancoso
2007-04-01
Full Text Available The increased complexity and operating frequency in current single chip microprocessors is resulting in a decrease in the performance improvements. Consequently, major manufacturers offer chip multiprocessor (CMP architectures in order to keep up with the expected performance gains. This architecture is successfully being introduced in many markets including that of the embedded systems. Nevertheless, the integration of several cores onto the same chip may lead to increased heat dissipation and consequently additional costs for cooling, higher power consumption, decrease of the reliability, and thermal-induced performance loss, among others. In this paper, we analyze the evolution of the thermal issues for the future chip multiprocessor architectures and show that as the number of on-chip cores increases, the thermal-induced problems will worsen. In addition, we present several scenarios that result in excessive thermal stress to the CMP chip or significant performance loss. In order to minimize or even eliminate these problems, we propose thermal-aware scheduler (TAS algorithms. When assigning processes to cores, TAS takes their temperature and cooling ability into account in order to avoid thermal stress and at the same time improve the performance. Experimental results have shown that a TAS algorithm that considers also the temperatures of neighboring cores is able to significantly reduce the temperature-induced performance loss while at the same time, decrease the chip's temperature across many different operation and configuration scenarios.
Performances of multiprocessor multidisk architectures for continuous media storage
Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.
1996-03-01
Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
Correlation inequalities for two-component hypercubic /varreverse arrowphi/4 models
Soria, J.L.
1988-01-01
A collection of new and already known correlation inequalities is found for a family of two-component hypercubic /varreverse arrowphi/ 4 models, using techniques of duplicated variables, rotated correlation inequalities, and random walk representation. Among the interesting new inequalities are: rotated very special Dunlop-Newman inequality 2 ; /varreverse arrowphi//sub 1z/ 2 + /varreverse arrowphi//sub 2z/ 2 ≥ 0, rotated Griffiths I inequality 2 - /varreverse arrowphi//sub 2z/ 2 > ≥ 0, and anti-Lebowitz inequality u 4 1111 ≥ 0
Asymptotics of the spectral gap for the interchange process on large hypercubes
Starr, Shannon; Conomos, Matthew P
2011-01-01
We consider the interchange process (IP) on the d-dimensional, discrete hypercube of side-length n. Specifically, we compare the spectral gap of the IP to the spectral gap of the random walk (RW) on the same graph. We prove that the two spectral gaps are asymptotically equivalent, in the limit n→∞. This result gives further supporting evidence for a conjecture of Aldous, that the spectral gap of the IP equals the spectral gap of the RW on all finite graphs. Our proof is based on an argument invented by Handjani and Jungreis, who proved Aldous's conjecture for all trees
System-Level Design Methodologies for Networked Multiprocessor Systems-on-Chip
Virk, Kashif Munir
2008-01-01
is the first such attempt in the published literature. The second part of the thesis deals with the issues related to the development of system-level design methodologies for networked multiprocessor systems-on-chip at various levels of design abstraction with special focus on the modeling and design...... at the system-level. The multiprocessor modeling framework is then extended to include models of networked multiprocessor systems-on-chip which is then employed to model wireless sensor networks both at the sensor node level as well as the wireless network level. In the third and the final part, the thesis...... to the transaction-level model. The thesis, as a whole makes contributions by describing a design methodology for networked multiprocessor embedded systems at three layers of abstraction from system-level through transaction-level to the cycle accurate level as well as demonstrating it practically by implementing...
Operating system for a real-time multiprocessor propulsion system simulator. User's manual
Cole, G. L.
1985-01-01
The NASA Lewis Research Center is developing and evaluating experimental hardware and software systems to help meet future needs for real-time, high-fidelity simulations of air-breathing propulsion systems. Specifically, the real-time multiprocessor simulator project focuses on the use of multiple microprocessors to achieve the required computing speed and accuracy at relatively low cost. Operating systems for such hardware configurations are generally not available. A real time multiprocessor operating system (RTMPOS) that supports a variety of multiprocessor configurations was developed at Lewis. With some modification, RTMPOS can also support various microprocessors. RTMPOS, by means of menus and prompts, provides the user with a versatile, user-friendly environment for interactively loading, running, and obtaining results from a multiprocessor-based simulator. The menu functions are described and an example simulation session is included to demonstrate the steps required to go from the simulation loading phase to the execution phase.
Standard interfaces for program-modular multiprocessor systems
Chernykh, E.V.
1982-01-01
The peculiarities of the structures of existing and developed standard interfaces used in automation systems for nuclear physical experiments are considered. general structural characteristics of multiprocessor system interfaces are revealed. The comparison of the existing system CAMAC crate and designed standards of COMPEX, E3S and FASTBUS interfaces by capacity and relative cost is carried out. The analysis of the given data shows that operation of any interface is more advantageous at the rates close to capacity values, the relative cost being minimum. In this case the advantage is on the side of interfaces with greater capacity values for which at a moderated decrease of the exchange or requests processing rate the relative costs grow slower. A higher capacity of one-cycle exchange is provided with functional data way specialization in the interface. The conclusion is drawn that most perspective trend in the development of automation systems for high energy physics experiments is using FASTBUS standard
An Adaptive Hybrid Multiprocessor technique for bioinformatics sequence alignment
Bonny, Talal
2012-07-28
Sequence alignment algorithms such as the Smith-Waterman algorithm are among the most important applications in the development of bioinformatics. Sequence alignment algorithms must process large amounts of data which may take a long time. Here, we introduce our Adaptive Hybrid Multiprocessor technique to accelerate the implementation of the Smith-Waterman algorithm. Our technique utilizes both the graphics processing unit (GPU) and the central processing unit (CPU). It adapts to the implementation according to the number of CPUs given as input by efficiently distributing the workload between the processing units. Using existing resources (GPU and CPU) in an efficient way is a novel approach. The peak performance achieved for the platforms GPU + CPU, GPU + 2CPUs, and GPU + 3CPUs is 10.4 GCUPS, 13.7 GCUPS, and 18.6 GCUPS, respectively (with the query length of 511 amino acid). © 2010 IEEE.
Multi-processor network implementations in Multibus II and VME
Briegel, C.
1992-01-01
ACNET (Fermilab Accelerator Controls Network), a proprietary network protocol, is implemented in a multi-processor configuration for both Multibus II and VME. The implementations are contrasted by the bus protocol and software design goals. The Multibus II implementation provides for multiple processors running a duplicate set of tasks on each processor. For a network connected task, messages are distributed by a network round-robin scheduler. Further, messages can be stopped, continued, or re-routed for each task by user-callable commands. The VME implementation provides for multiple processors running one task across all processors. The process can either be fixed to a particular processor or dynamically allocated to an available processor depending on the scheduling algorithm of the multi-processing operating system. (author)
A measurement-based performability model for a multiprocessor system
Ilsueh, M. C.; Iyer, Ravi K.; Trivedi, K. S.
1987-01-01
A measurement-based performability model based on real error-data collected on a multiprocessor system is described. Model development from the raw errror-data to the estimation of cumulative reward is described. Both normal and failure behavior of the system are characterized. The measured data show that the holding times in key operational and failure states are not simple exponential and that semi-Markov process is necessary to model the system behavior. A reward function, based on the service rate and the error rate in each state, is then defined in order to estimate the performability of the system and to depict the cost of different failure types and recovery procedures.
Analysis and Optimisation of Hierarchically Scheduled Multiprocessor Embedded Systems
Pop, Traian; Pop, Paul; Eles, Petru
2008-01-01
We present an approach to the analysis and optimisation of heterogeneous multiprocessor embedded systems. The systems are heterogeneous not only in terms of hardware components, but also in terms of communication protocols and scheduling policies. When several scheduling policies share a resource......, they are organised in a hierarchy. In this paper, we first develop a holistic scheduling and schedulability analysis that determines the timing properties of a hierarchically scheduled system. Second, we address design problems that are characteristic to such hierarchically scheduled systems: assignment...... of scheduling policies to tasks, mapping of tasks to hardware components, and the scheduling of the activities. We also present several algorithms for solving these problems. Our heuristics are able to find schedulable implementations under limited resources, achieving an efficient utilisation of the system...
Geometric Algorithms for Private-Cache Chip Multiprocessors
Ajwani, Deepak; Sitchinava, Nodari; Zeh, Norbert
2010-01-01
-D convex hulls. These results are obtained by analyzing adaptations of either the PEM merge sort algorithm or PRAM algorithms. For the second group of problems—orthogonal line segment intersection reporting, batched range reporting, and related problems—more effort is required. What distinguishes......We study techniques for obtaining efficient algorithms for geometric problems on private-cache chip multiprocessors. We show how to obtain optimal algorithms for interval stabbing counting, 1-D range counting, weighted 2-D dominance counting, and for computing 3-D maxima, 2-D lower envelopes, and 2...... these problems from the ones in the previous group is the variable output size, which requires I/O-efficient load balancing strategies based on the contribution of the individual input elements to the output size. To obtain nearly optimal algorithms for these problems, we introduce a parallel distribution...
Using a commercial symmetric multiprocessor for lattice QCD
Brower, R.C.; Chen, D.; Negele, J.W.
1998-01-01
In its evolution, the computer industry has reached the point when considerable computing power can be packaged on a single microprocessor chip. At the same time, costs of designing a computer system around such a CPU are growing. For these reasons we decided to explore a possibility of using commercially available symmetric multiprocessors (SMP) as building blocks for the LQCD computer. Careful analysis of the architecture allowed us to build a QCD primitive library running close to the peak performance on the UltraSPARC processor. As a result, multithreaded QCD code (both the heatbath and the Wilson fermion inverter) runs at about 50% efficiency on a single SMP. The communication between different CPUs is handled by a coherent memory system. Currently we are planning to connect several SMPs with a high bandwidth network into a single system. (orig.)
Utilizing a multiprocessor architecture - The performance of MIDAS
Maples, C.; Logan, D.; Meng, J.; Rathbun, W.; Weaver, D.
1983-01-01
The MIDAS architecture organizes multiple CPUs into clusters called distributed subsystems. Each subsystem consists of an array of processors controlled by a supervisory CPU. The multiprocessor array is composed of commercial CPUs (with floating point hardware) and specialized processing elements. Interprocessor communication within the array may occur either through switched memory modules or common shared memory. The architecture permits multiple processors to be focused on single problems. A distributed subsystem has been constructed and tested. It currently consists of a supervisor CPU; 16 blocks of independently switchable memory; 9 general purpose, VAX-class CPUs; and 2 specialized pipelined processors to handle I/O. Results on a variety of problems indicate that the subsystem performs 8 to 15 times faster than a standard computer with an identical CPU. The difference in performance represents the effect of differing CPU and I/O requirements
Hardware support for CSP on a Java chip multiprocessor
Gruian, Flavius; Schoeberl, Martin
2013-01-01
Due to memory bandwidth limitations, chip multiprocessors (CMPs) adopting the convenient shared memory model for their main memory architecture scale poorly. On-chip core-to-core communication is a solution to this problem, that can lead to further performance increase for a number of multithreaded...... applications. Programmatically, the Communicating Sequential Processes (CSPs) paradigm provides a sound computational model for such an architecture with message based communication. In this paper we explore hardware support for CSP in the context of an embedded Java CMP. The hardware support for CSP are on......-chip communication channels, implemented by a ring-based network-on-chip (NoC), to reduce the memory bandwidth pressure on the shared memory.The presented solution is scalable and also specific for our limited resources and real-time predictability requirements. CMP architectures of three to eight processors were...
Recommending the heterogeneous cluster type multi-processor system computing
Iijima, Nobukazu
2010-01-01
Real-time reactor simulator had been developed by reusing the equipment of the Musashi reactor and its performance improvement became indispensable for research tools to increase sampling rate with introduction of arithmetic units using multi-Digital Signal Processor(DSP) system (cluster). In order to realize the heterogeneous cluster type multi-processor system computing, combination of two kinds of Control Processor (CP) s, Cluster Control Processor (CCP) and System Control Processor (SCP), were proposed with Large System Control Processor (LSCP) for hierarchical cluster if needed. Faster computing performance of this system was well evaluated by simulation results for simultaneous execution of plural jobs and also pipeline processing between clusters, which showed the system led to effective use of existing system and enhancement of the cost performance. (T. Tanaka)
Behavioral Simulation and Performance Evaluation of Multi-Processor Architectures
Ausif Mahmood
1996-01-01
Full Text Available The development of multi-processor architectures requires extensive behavioral simulations to verify the correctness of design and to evaluate its performance. A high level language can provide maximum flexibility in this respect if the constructs for handling concurrent processes and a time mapping mechanism are added. This paper describes a novel technique for emulating hardware processes involved in a parallel architecture such that an object-oriented description of the design is maintained. The communication and synchronization between hardware processes is handled by splitting the processes into their equivalent subprograms at the entry points. The proper scheduling of these subprograms is coordinated by a timing wheel which provides a time mapping mechanism. Finally, a high level language pre-processor is proposed so that the timing wheel and the process emulation details can be made transparent to the user.
Performance of Multithreaded Chip Multiprocessors And Implications for Operating System Design
Fedorova, Alexandra; Seltzer, Margo I.; Small, Christopher A.; Nussbaum, Daniel
2005-01-01
An operating system’s design is often influenced by the architecture of the target hardware. While uniprocessor and multiprocessor architectures are well understood, such is not the case for multithreaded chip multiprocessors (CMT) – a new generation of processors designed to improve performance of memory-intensive applications. The first systems equipped with CMT processors are just becoming available, so it is critical that we now understand how to obtain the best performance from such syst...
Runtime adaptive multi-processor system-on-chip: RAMPSoC
Göhringer, D.; Hübner, M.; Schatz, V.; Becker, J.
2008-01-01
Current trends in high performance computing show, that the usage of multiprocessor systems on chip are one approach for the requirements of computing intensive applications. The multiprocessor system on chip (MPSoC) approaches often provide a static and homogeneous infrastructure of networked microprocessor on the chip die. A novel idea in this research area is to introduce the dynamic adaptivity of reconfigurable hardware in order to provide a flexible heterogeneous set of processing elemen...
Safety-critical Java with cyclic executives on chip-multiprocessors
Ravn, Anders P.; Schoeberl, Martin
2012-01-01
Chip-multiprocessors offer increased processing power at a low cost. However, in order to use them for real-time systems, tasks have to be scheduled efficiently and predictably. It is well known that finding optimal schedules is a computationally hard problem. In this paper we present a solution ...... for multiprocessors, we have implemented it in the context of safety-critical Java on a Java processor....
Energies and radial distributions of Bs mesons - the effect of hypercubic blocking
Koponen, Jonna
2006-12-01
This is a follow-up to our earlier work for the energies and the charge (vector) and matter (scalar) distributions for S-wave states in a heavy-light meson, where the heavy quark is static and the light quark has a mass about that of the strange quark. We study the radial distributions of higher angular momentum states, namely P- and D-wave states, using a "fuzzy" static quark. A new improvement is the use of hypercubic blocking in the time direction, which effectively constrains the heavy quark to move within a 2a hypercube (a is the lattice spacing). The calculation is carried out with dynamical fermions on a 163 × 32 lattice with a ≈ 0.10 fm generated using the non-perturbatively improved clover action. The configurations were gener- ated by the UKQCD Collaboration using lattice action parameters β = 5.2, c SW = 2.0171 and κ = 0.1350. In nature the closest equivalent of this heavy-light system is the Bs meson. Attempts are now being made to understand these results in terms of the Dirac equation.
Application of Latin hypercube sampling to RADTRAN 4 truck accident risk sensitivity analysis
Mills, G.S.; Neuhauser, K.S.; Kanipe, F.L.
1994-01-01
The sensitivity of calculated dose estimates to various RADTRAN 4 inputs is an available output for incident-free analysis because the defining equations are linear and sensitivity to each variable can be calculated in closed mathematical form. However, the necessary linearity is not characteristic of the equations used in calculation of accident dose risk, making a similar tabulation of sensitivity for RADTRAN 4 accident analysis impossible. Therefore, a study of sensitivity of accident risk results to variation of input parameters was performed using representative routes, isotopic inventories, and packagings. It was determined that, of the approximately two dozen RADTRAN 4 input parameters pertinent to accident analysis, only a subset of five or six has significant influence on typical analyses or is subject to random uncertainties. These five or six variables were selected as candidates for Latin Hypercube Sampling applications. To make the effect of input uncertainties on calculated accident risk more explicit, distributions and limits were determined for two variables which had approximately proportional effects on calculated doses: Pasquill Category probability (PSPROB) and link population density (LPOPD). These distributions and limits were used as input parameters to Sandia's Latin Hypercube Sampling code to generate 50 sets of RADTRAN 4 input parameters used together with point estimates of other necessary inputs to calculate 50 observations of estimated accident dose risk.Tabulations of the RADTRAN 4 accident risk input variables and their influence on output plus illustrative examples of the LHS calculations, for truck transport situations that are typical of past experience, will be presented
Combining Latin Hypercube Designs and Discrete Event Simulation in a Study of a Surgical Unit
Dehlendorff, Christian; Andersen, Klaus Kaae; Kulahci, Murat
Summary form given only:In this article experiments on a discrete event simulation model for an orthopedic surgery are considered. The model is developed as part of a larger project in co-operation with Copenhagen University Hospital in Gentofte. Experiments on the model are performed by using...... Latin hypercube designs. The parameter set consists of system settings such as use of preparation room for sedation and the number of operating rooms, as well as management decisions such as staffing, size of the recovery room and the number of simultaneously active operating rooms. Sensitivity analysis...... and optimization combined with meta-modeling are employed in search for optimal setups. The primary objective in this article is to minimize time spent by the patients in the system. The overall long-term objective for the orthopedic surgery unit is to minimize time lost during the pre- and post operation...
Baron, Jorge H.; Nunez Mac Leod, J.E.
2000-01-01
The present paper deals with the utilization of advanced sampling statistical methods to perform uncertainty and sensitivity analysis on numerical models. Such models may represent physical phenomena, logical structures (such as boolean expressions) or other systems, and various of their intrinsic parameters and/or input variables are usually treated as random variables simultaneously. In the present paper a simple method to scale-up Latin Hypercube Sampling (LHS) samples is presented, starting with a small sample and duplicating its size at each step, making it possible to use the already run numerical model results with the smaller sample. The method does not distort the statistical properties of the random variables and does not add any bias to the samples. The results is a significant reduction in numerical models running time can be achieved (by re-using the previously run samples), keeping all the advantages of LHS, until an acceptable representation level is achieved in the output variables. (author)
Intelligent discrete particle swarm optimization for multiprocessor task scheduling problem
S Sarathambekai
2017-03-01
Full Text Available Discrete particle swarm optimization is one of the most recently developed population-based meta-heuristic optimization algorithm in swarm intelligence that can be used in any discrete optimization problems. This article presents a discrete particle swarm optimization algorithm to efficiently schedule the tasks in the heterogeneous multiprocessor systems. All the optimization algorithms share a common algorithmic step, namely population initialization. It plays a significant role because it can affect the convergence speed and also the quality of the final solution. The random initialization is the most commonly used method in majority of the evolutionary algorithms to generate solutions in the initial population. The initial good quality solutions can facilitate the algorithm to locate the optimal solution or else it may prevent the algorithm from finding the optimal solution. Intelligence should be incorporated to generate the initial population in order to avoid the premature convergence. This article presents a discrete particle swarm optimization algorithm, which incorporates opposition-based technique to generate initial population and greedy algorithm to balance the load of the processors. Make span, flow time, and reliability cost are three different measures used to evaluate the efficiency of the proposed discrete particle swarm optimization algorithm for scheduling independent tasks in distributed systems. Computational simulations are done based on a set of benchmark instances to assess the performance of the proposed algorithm.
On a Multiprocessor Computer Farm for Online Physics Data Processing
Sinanis, N J
1999-01-01
The topic of this thesis is the design-phase performance evaluation of a large multiprocessor (MP) computer farm intended for the on-line data processing of the Compact Muon Solenoid (CMS) experiment. CMS is a high energy Physics experiment, planned to operate at CERN (Geneva, Switzerland) during the year 2005. The CMS computer farm is consisting of 1,000 MP computer systems and a 1,000 X 1,000 communications switch. The followed approach to the farm performance evaluation is through simulation studies and evaluation of small prototype systems building blocks of the farm. For the purposes of the simulation studies, we have developed a discrete-event, event-driven simulator that is capable to describe the high-level architecture of the farm and give estimates of the farm's performance. The simulator is designed in a modular way to facilitate the development of various modules that model the behavior of the farm building blocks in the desired level of detail. With the aid of this simulator, we make a particular...
Numerical fluid flow and heat transfer calculations on multiprocessor systems
Oehman, G.A.; Malen, T.E.; Kuusela, P.
1989-01-01
The first part of the report presents the basic principles of parallel processing, and factors influencing tbe efficiency of practical applications are discussed. In a multiprocessor computer, different parts of the program code are executed in parallel, i.e. simultaneous with respect to time, on different processors, and thus it becomes possible to decrease the overall computation time by a factor, which in the ideal case is equal to the number of processors. The application study starts from the numerical solution of the twodimesional Laplace equation, which describes the steady heat conduction in a solid plate and advances through the solution of the three dimensional Laplace equation to the case of study laminar fluid flow in a twodimensional box at Reynolds numbers up to 20. Hereby the stream function-vorticity method is first applied and the SIMPLER method. The conventional (sequential) numerical algoritms for these fluid flow and heat transfer problems are found not to be ideally suited for conversion to parallel computation, but sped-up ratios considerably above 50 % of the theoretical maximum are regularly achieved in the runs. The numerical procedures we coded in the OCCAM-2 language and the test runs were performed at who Akademi on the imperimental HATHI-computers containing 16 T4l4 and 100 INMOS T800 transputers respectively.
Numerical fluid flow and heat transfer calculations on multiprocessor systems
Oehman, G.A.; Malen, T.E.; Kuusela, P.
1989-12-31
The first part of the report presents the basic principles of parallel processing, and factors influencing tbe efficiency of practical applications are discussed. In a multiprocessor computer, different parts of the program code are executed in parallel, i.e. simultaneous with respect to time, on different processors, and thus it becomes possible to decrease the overall computation time by a factor, which in the ideal case is equal to the number of processors. The application study starts from the numerical solution of the twodimesional Laplace equation, which describes the steady heat conduction in a solid plate and advances through the solution of the three dimensional Laplace equation to the case of study laminar fluid flow in a twodimensional box at Reynolds numbers up to 20. Hereby the stream function-vorticity method is first applied and the SIMPLER method. The conventional (sequential) numerical algoritms for these fluid flow and heat transfer problems are found not to be ideally suited for conversion to parallel computation, but sped-up ratios considerably above 50 % of the theoretical maximum are regularly achieved in the runs. The numerical procedures we coded in the OCCAM-2 language and the test runs were performed at who Akademi on the imperimental HATHI-computers containing 16 T4l4 and 100 INMOS T800 transputers respectively.
USC orthogonal multiprocessor for image processing with neural networks
Hwang, Kai; Panda, Dhabaleswar K.; Haddadi, Navid
1990-07-01
This paper presents the architectural features and imaging applications of the Orthogonal MultiProcessor (OMP) system, which is under construction at the University of Southern California with research funding from NSF and assistance from several industrial partners. The prototype OMP is being built with 16 Intel i860 RISC microprocessors and 256 parallel memory modules using custom-designed spanning buses, which are 2-D interleaved and orthogonally accessed without conflicts. The 16-processor OMP prototype is targeted to achieve 430 MIPS and 600 Mflops, which have been verified by simulation experiments based on the design parameters used. The prototype OMP machine will be initially applied for image processing, computer vision, and neural network simulation applications. We summarize important vision and imaging algorithms that can be restructured with neural network models. These algorithms can efficiently run on the OMP hardware with linear speedup. The ultimate goal is to develop a high-performance Visual Computer (Viscom) for integrated low- and high-level image processing and vision tasks.
Supporting Multiprocessors in the Icecap Safety-Critical Java Run-Time Environment
Zhao, Shuai; Wellings, Andy; Korsholm, Stephan Erbs
The current version of the Safety Critical Java (SCJ) specification defines three compliance levels. Level 0 targets single processor programs while Level 1 and 2 can support multiprocessor platforms. Level 1 programs must be fully partitioned but Level 2 programs can also be more globally...... scheduled. As of yet, there is no official Reference Implementation for SCJ. However, the icecap project has produced a Safety-Critical Java Run-time Environment based on the Hardware-near Virtual Machine (HVM). This supports SCJ at all compliance levels and provides an implementation of the safety......-critical Java (javax.safetycritical) package. This is still work-in-progress and lacks certain key features. Among these is the ability to support multiprocessor platforms. In this paper, we explore two possible options to adding multiprocessor support to this environment: the “green thread” and the “native...
A general model for memory interference in a multiprocessor system with memory hierarchy
Taha, Badie A.; Standley, Hilda M.
1989-01-01
The problem of memory interference in a multiprocessor system with a hierarchy of shared buses and memories is addressed. The behavior of the processors is represented by a sequence of memory requests with each followed by a determined amount of processing time. A statistical queuing network model for determining the extent of memory interference in multiprocessor systems with clusters of memory hierarchies is presented. The performance of the system is measured by the expected number of busy memory clusters. The results of the analytic model are compared with simulation results, and the correlation between them is found to be very high.
Hard Real-Time Performances in Multiprocessor-Embedded Systems Using ASMP-Linux
Daniel Pierre Bovet
2008-01-01
Full Text Available Multiprocessor systems, especially those based on multicore or multithreaded processors, and new operating system architectures can satisfy the ever increasing computational requirements of embedded systems. ASMP-LINUX is a modified, high responsiveness, open-source hard real-time operating system for multiprocessor systems capable of providing high real-time performance while maintaining the code simple and not impacting on the performances of the rest of the system. Moreover, ASMP-LINUX does not require code changing or application recompiling/relinking. In order to assess the performances of ASMP-LINUX, benchmarks have been performed on several hardware platforms and configurations.
Hard Real-Time Performances in Multiprocessor-Embedded Systems Using ASMP-Linux
Betti Emiliano
2008-01-01
Full Text Available Abstract Multiprocessor systems, especially those based on multicore or multithreaded processors, and new operating system architectures can satisfy the ever increasing computational requirements of embedded systems. ASMP-LINUX is a modified, high responsiveness, open-source hard real-time operating system for multiprocessor systems capable of providing high real-time performance while maintaining the code simple and not impacting on the performances of the rest of the system. Moreover, ASMP-LINUX does not require code changing or application recompiling/relinking. In order to assess the performances of ASMP-LINUX, benchmarks have been performed on several hardware platforms and configurations.
Pleros, Nikos; Maniotis, Pavlos; Alexoudi, Theonitsa; Fitsios, Dimitris; Vagionas, Christos; Papaioannou, Sotiris; Vyrsokinos, K.; Kanellos, George T.
2014-03-01
The processor-memory performance gap, commonly referred to as "Memory Wall" problem, owes to the speed mismatch between processor and electronic RAM clock frequencies, forcing current Chip Multiprocessor (CMP) configurations to consume more than 50% of the chip real-estate for caching purposes. In this article, we present our recent work spanning from Si-based integrated optical RAM cell architectures up to complete optical cache memory architectures for Chip Multiprocessor configurations. Moreover, we discuss on e/o router subsystems with up to Tb/s routing capacity for cache interconnection purposes within CMP configurations, currently pursued within the FP7 PhoxTrot project.
A system-level multiprocessor system-on-chip modeling framework
Virk, Kashif Munir; Madsen, Jan
2004-01-01
We present a system-level modeling framework to model system-on-chips (SoC) consisting of heterogeneous multiprocessors and network-on-chip communication structures in order to enable the developers of today's SoC designs to take advantage of the flexibility and scalability of network-on-chip and...... SoC design. We show how a hand-held multimedia terminal, consisting of JPEG, MP3 and GSM applications, can be modeled as a multiprocessor SoC in our framework....
Silvio Picano
1992-01-01
Full Text Available We present detailed experimental work involving a commercially available large scale shared memory multiple instruction stream-multiple data stream (MIMD parallel computer having a software controlled cache coherence mechanism. To make effective use of such an architecture, the programmer is responsible for designing the program's structure to match the underlying multiprocessors capabilities. We describe the techniques used to exploit our multiprocessor (the BBN TC2000 on a network simulation program, showing the resulting performance gains and the associated programming costs. We show that an efficient implementation relies heavily on the user's ability to explicitly manage the memory system.
Mapping of H.264 decoding on a multiprocessor architecture
van der Tol, Erik B.; Jaspers, Egbert G.; Gelderblom, Rob H.
2003-05-01
Due to the increasing significance of development costs in the competitive domain of high-volume consumer electronics, generic solutions are required to enable reuse of the design effort and to increase the potential market volume. As a result from this, Systems-on-Chip (SoCs) contain a growing amount of fully programmable media processing devices as opposed to application-specific systems, which offered the most attractive solutions due to a high performance density. The following motivates this trend. First, SoCs are increasingly dominated by their communication infrastructure and embedded memory, thereby making the cost of the functional units less significant. Moreover, the continuously growing design costs require generic solutions that can be applied over a broad product range. Hence, powerful programmable SoCs are becoming increasingly attractive. However, to enable power-efficient designs, that are also scalable over the advancing VLSI technology, parallelism should be fully exploited. Both task-level and instruction-level parallelism can be provided by means of e.g. a VLIW multiprocessor architecture. To provide the above-mentioned scalability, we propose to partition the data over the processors, instead of traditional functional partitioning. An advantage of this approach is the inherent locality of data, which is extremely important for communication-efficient software implementations. Consequently, a software implementation is discussed, enabling e.g. SD resolution H.264 decoding with a two-processor architecture, whereas High-Definition (HD) decoding can be achieved with an eight-processor system, executing the same software. Experimental results show that the data communication considerably reduces up to 65% directly improving the overall performance. Apart from considerable improvement in memory bandwidth, this novel concept of partitioning offers a natural approach for optimally balancing the load of all processors, thereby further improving the
Nadkarni, P M; Miller, P L
1991-01-01
A parallel program for inter-database sequence comparison was developed on the Intel Hypercube using two models of parallel programming. One version was built using machine-specific Hypercube parallel programming commands. The other version was built using Linda, a machine-independent parallel programming language. The two versions of the program provide a case study comparing these two approaches to parallelization in an important biological application area. Benchmark tests with both programs gave comparable results with a small number of processors. As the number of processors was increased, the Linda version was somewhat less efficient. The Linda version was also run without change on Network Linda, a virtual parallel machine running on a network of desktop workstations.
Kominsky, P.J.
1990-01-01
A production 3-D Beam and Warming implicit Navier-Stokes code has been implemented on the NCUBE hypercube using the grid allocation scheme of Bruno and Capello. Predicted (32-bit) performance on 1024 nodes is 67.1 MFLOPS. Efficiencies of 70% are attainable for implicit algorithms, although constant-memory scaled performance is found to decrease with increasing number of nodes, unlike explicit implementations
Subsurface stormflow modeling with sensitivity analysis using a Latin-hypercube sampling technique
Gwo, J.P.; Toran, L.E.; Morris, M.D.; Wilson, G.V.
1994-09-01
Subsurface stormflow, because of its dynamic and nonlinear features, has been a very challenging process in both field experiments and modeling studies. The disposal of wastes in subsurface stormflow and vadose zones at Oak Ridge National Laboratory, however, demands more effort to characterize these flow zones and to study their dynamic flow processes. Field data and modeling studies for these flow zones are relatively scarce, and the effect of engineering designs on the flow processes is poorly understood. On the basis of a risk assessment framework and a conceptual model for the Oak Ridge Reservation area, numerical models of a proposed waste disposal site were built, and a Latin-hypercube simulation technique was used to study the uncertainty of model parameters. Four scenarios, with three engineering designs, were simulated, and the effectiveness of the engineering designs was evaluated. Sensitivity analysis of model parameters suggested that hydraulic conductivity was the most influential parameter. However, local heterogeneities may alter flow patterns and result in complex recharge and discharge patterns. Hydraulic conductivity, therefore, may not be used as the only reference for subsurface flow monitoring and engineering operations. Neither of the two engineering designs, capping and French drains, was found to be effective in hydrologically isolating downslope waste trenches. However, pressure head contours indicated that combinations of both designs may prove more effective than either one alone
Harper, W.V.; Gupta, S.K.
1983-10-01
A computer code was used to study steady-state flow for a hypothetical borehole scenario. The model consists of three coupled equations with only eight parameters and three dependent variables. This study focused on steady-state flow as the performance measure of interest. Two different approaches to sensitivity/uncertainty analysis were used on this code. One approach, based on Latin Hypercube Sampling (LHS), is a statistical sampling method, whereas, the second approach is based on the deterministic evaluation of sensitivities. The LHS technique is easy to apply and should work well for codes with a moderate number of parameters. Of deterministic techniques, the direct method is preferred when there are many performance measures of interest and a moderate number of parameters. The adjoint method is recommended when there are a limited number of performance measures and an unlimited number of parameters. This unlimited number of parameters capability can be extremely useful for finite element or finite difference codes with a large number of grid blocks. The Office of Nuclear Waste Isolation will use the technique most appropriate for an individual situation. For example, the adjoint method may be used to reduce the scope to a size that can be readily handled by a technique such as LHS. Other techniques for sensitivity/uncertainty analysis, e.g., kriging followed by conditional simulation, will be used also. 15 references, 4 figures, 9 tables
Generalized hypercube graph $\\Q_n(S$, graph products and self-orthogonal codes
Pani Seneviratne
2016-01-01
Full Text Available A generalized hypercube graph $\\Q_n(S$ has $\\F_{2}^{n}=\\{0,1\\}^n$ as the vertex set and two vertices being adjacent whenever their mutual Hamming distance belongs to $S$, where $n \\ge 1$ and $S\\subseteq \\{1,2,\\ldots, n\\}$. The graph $\\Q_n(\\{1\\}$ is the $n$-cube, usually denoted by $\\Q_n$.We study graph boolean products $G_1 = \\Q_n(S\\times \\Q_1, G_2 = \\Q_{n}(S\\wedge \\Q_1$, $G_3 = \\Q_{n}(S[\\Q_1]$ and show that binary codes from neighborhood designs of $G_1, G_2$ and $G_3$ are self-orthogonal for all choices of $n$ and $S$. More over, we show that the class of codes $C_1$ are self-dual. Further we find subgroups of the automorphism group of these graphs and use these subgroups to obtain PD-sets for permutation decoding. As an example we find a full error-correcting PD set for the binary $[32, 16, 8]$ extremal self-dual code.
A scalable single-chip multi-processor architecture with on-chip RTOS kernel
Theelen, B.D.; Verschueren, A.C.; Reyes Suarez, V.V.; Stevens, M.P.J.; Nunez, A.
2003-01-01
Now that system-on-chip technology is emerging, single-chip multi-processors are becoming feasible. A key problem of designing such systems is the complexity of their on-chip interconnects and memory architecture. It is furthermore unclear at what level software should be integrated. An example of a
Temporal analysis and scheduling of hard real-time radios running on a multi-processor
Moreira, O.
2012-01-01
On a multi-radio baseband system, multiple independent transceivers must share the resources of a multi-processor, while meeting each its own hard real-time requirements. Not all possible combinations of transceivers are known at compile time, so a solution must be found that either allows for
DAEDALUS: System-Level Design Methodology for Streaming Multiprocessor Embedded Systems on Chips
Stefanov, T.; Pimentel, A.; Nikolov, H.; Ha, S.; Teich, J.
2017-01-01
The complexity of modern embedded systems, which are increasingly based on heterogeneous multiprocessor system-on-chip (MPSoC) architectures, has led to the emergence of system-level design. To cope with this design complexity, system-level design aims at raising the abstraction level of the design
Design of massively parallel hardware multi-processors for highly-demanding embedded applications
Jozwiak, L.; Jan, Y.
2013-01-01
Many new embedded applications require complex computations to be performed to tight schedules, while at the same time demanding low energy consumption and low cost. For implementation of these highly-demanding applications, highly-optimized application-specific multi-processor system-on-a-chip
Abstractions for aperiodic multiprocessor scheduling of real-time stream processing applications
Hausmans, J.P.H.M.
2015-01-01
Embedded multiprocessor systems are often used in the domain of real-time stream processing applications to keep up with increasing power and performance requirements. Examples of such real-time stream processing applications are digital radio baseband processing and WLAN transceivers. These stream
Shared random access memory resource for multiprocessor real-time systems
Dimmler, D.G.; Hardy, W.H. II
1977-01-01
A shared random-access memory resource is described which is used within real-time data acquisition and control systems with multiprocessor and multibus organizations. Hardware and software aspects are discussed in a specific example where interconnections are done via a UNIBUS. The general applicability of the approach is also discussed
Distributed power management of real-time applications on a GALS multiprocessor SOC
Nelson, Andrew; Goossens, Kees
2015-01-01
It is generally desirable to reduce the power consumption of embedded systems. Dynamic Voltage and Frequency Scaling (DVFS) is a commonly applied technique to achieve power reduction at the cost of computational performance. Multiprocessor System on Chips (MPSoCs) can have multiple voltage and
Cache aware mapping of streaming apllications on a multiprocessor system-on-chip
Moonen, A.J.M.; Bekooij, M.J.G.; Berg, van den R.M.J.; Meerbergen, van J.; Sciuto, D.; Peng, Z.
2008-01-01
Efficient use of the memory hierarchy is critical for achieving high performance in a multiprocessor system- on-chip. An external memory that is shared between processors is a bottleneck in current and future systems. Cache misses and a large cache miss penalty contribute to a low processor
3D-TV Rendering on a Multiprocessor System on a Chip
Van Eijndhoven, J.T.J.; Li, X.
2006-01-01
This thesis focuses on the issue of mapping 3D-TV rendering applications to a multiprocessor platform. The target platform aims to address tomorrow's multi-media consumer market. The prototype chip, called Wasabi, contains a set of TriMedia processors that communicate viaa shared memory, fast
An FPGA design flow for reconfigurable network-based multi-processor systems on chip
Kumar, A.; Hansson, M.A; Huisken, J.; Corporaal, H.
2007-01-01
Multi-processor systems on chip (MPSoC) platforms are becoming increasingly more heterogeneous and are shifting towards a more communication-centric methodology. Networks on chip (NoC) have emerged as the design paradigm for scalable on-chip communication architectures. As the system complexity
Multi-processor system-level synthesis for multiple applications on platform FPGA
Kumar, A.; Fernando, S.D.; Ha, Y.; Mesman, B.; Corporaal, H.; Bertels, Koen
2007-01-01
Multiprocessor systems-on-chip (MPSoC) are being developed in increasing numbers to support the high number of applications running on modern embedded systems. Designing and programming such systems prove to be a major challenge. Most of the current design methodologies rely on creating the design
VME multiprocessor system for plasma control at the JT-60 Upgrade
Kimura, T.; Kurihara, K.; Takahashi, M.; Kawamata, Y.; Akasaka, H.; Matsukawa, M.
1989-01-01
In this paper design and preliminary tests are reported of a VME multiprocessor system for the JT-60 Upgrade plasma control utilizing three MC88100 based RISC computers and VME buses. The design of the VME system was stimulated by faster and more accurate computation requirements for the plasma position and shape control
Farzamian, Mohammad; Monteiro Santos, Fernando A.; Khalil, Mohamed A.
2017-12-01
The coupled hydrogeophysical approach has proved to be a valuable tool for improving the use of geoelectrical data for hydrological model parameterization. In the coupled approach, hydrological parameters are directly inferred from geoelectrical measurements in a forward manner to eliminate the uncertainty connected to the independent inversion of electrical resistivity data. Several numerical studies have been conducted to demonstrate the advantages of a coupled approach; however, only a few attempts have been made to apply the coupled approach to actual field data. In this study, we developed a 1D coupled hydrogeophysical code to estimate the van Genuchten-Mualem model parameters, K s, n, θ r and α, from time-lapse vertical electrical sounding data collected during a constant inflow infiltration experiment. van Genuchten-Mualem parameters were sampled using the Latin hypercube sampling method to provide a full coverage of the range of each parameter from their distributions. By applying the coupled approach, vertical electrical sounding data were coupled to hydrological models inferred from van Genuchten-Mualem parameter samples to investigate the feasibility of constraining the hydrological model. The key approaches taken in the study are to (1) integrate electrical resistivity and hydrological data and avoiding data inversion, (2) estimate the total water mass recovery of electrical resistivity data and consider it in van Genuchten-Mualem parameters evaluation and (3) correct the influence of subsurface temperature fluctuations during the infiltration experiment on electrical resistivity data. The results of the study revealed that the coupled hydrogeophysical approach can improve the value of geophysical measurements in hydrological model parameterization. However, the approach cannot overcome the technical limitations of the geoelectrical method associated with resolution and of water mass recovery.
Deller, A. T.; Tingay, S. J.; Bailes, M.; West, C.
2007-01-01
We describe the development of an FX style correlator for Very Long Baseline Interferometry (VLBI), implemented in software and intended to run in multi-processor computing environments, such as large clusters of commodity machines (Beowulf clusters) or computers specifically designed for high performance computing, such as multi-processor shared-memory machines. We outline the scientific and practical benefits for VLBI correlation, these chiefly being due to the inherent flexibility of softw...
A possible approach to estimating the operational efficiency of multiprocessor systems
Kuznetsov, N.Y.; Gorlach, S.P.; Sumskaya, A.A.
1984-01-01
This article presents a mathematical model that constructs the upper and lower estimates evaluating the efficiency of solution of a large class of problems using a multiprocessor system with a specific architecture. Efficiency depends on a system's architecture (e.g., the number of processors, memory volume, the number of communication links, commutation speed) and the types of problems it is intended to solve. The behavior of the model is considered in a stationary mode. The model is used to evaluate the efficiency of a particular algorithm implemented in a multiprocessor system. It is concluded that the model is flexible and enables the investigation of a broad class of problems in computational mathematics, including linear algebra and boundary-value problems of mathematical physics
Parallelising a molecular dynamics algorithm on a multi-processor workstation
Müller-Plathe, Florian
1990-12-01
The Verlet neighbour-list algorithm is parallelised for a multi-processor Hewlett-Packard/Apollo DN10000 workstation. The implementation makes use of memory shared between the processors. It is a genuine master-slave approach by which most of the computational tasks are kept in the master process and the slaves are only called to do part of the nonbonded forces calculation. The implementation features elements of both fine-grain and coarse-grain parallelism. Apart from three calls to library routines, two of which are standard UNIX calls, and two machine-specific language extensions, the whole code is written in standard Fortran 77. Hence, it may be expected that this parallelisation concept can be transfered in parts or as a whole to other multi-processor shared-memory computers. The parallel code is routinely used in production work.
Molchanov, I.N.; Khimich, A.N.
1984-01-01
This article shows how a reflection method can be used to find the eigenvalues of a matrix by transforming the matrix to tridiagonal form. The method of conjugate gradients is used to find the smallest eigenvalue and the corresponding eigenvector of symmetric positive-definite band matrices. Topics considered include the computational scheme of the reflection method, the organization of parallel calculations by the reflection method, the computational scheme of the conjugate gradient method, the organization of parallel calculations by the conjugate gradient method, and the effectiveness of parallel algorithms. It is concluded that it is possible to increase the overall effectiveness of the multiprocessor electronic computers by either letting the newly available processors of a new problem operate in the multiprocessor mode, or by improving the coefficient of uniform partition of the original information
Weizhe Zhang
2014-01-01
Full Text Available Energy consumption in computer systems has become a more and more important issue. High energy consumption has already damaged the environment to some extent, especially in heterogeneous multiprocessors. In this paper, we first formulate and describe the energy-aware real-time task scheduling problem in heterogeneous multiprocessors. Then we propose a particle swarm optimization (PSO based algorithm, which can successfully reduce the energy cost and the time for searching feasible solutions. Experimental results show that the PSO-based energy-aware metaheuristic uses 40%–50% less energy than the GA-based and SFLA-based algorithms and spends 10% less time than the SFLA-based algorithm in finding the solutions. Besides, it can also find 19% more feasible solutions than the SFLA-based algorithm.
Safe and Efficient Support for Embeded Multi-Processors in ADA
Ruiz, Jose F.
2010-08-01
New software demands increasing processing power, and multi-processor platforms are spreading as the answer to achieve the required performance. Embedded real-time systems are also subject to this trend, but in the case of real-time mission-critical systems, the properties of reliability, predictability and analyzability are also paramount. The Ada 2005 language defined a subset of its tasking model, the Ravenscar profile, that provides the basis for the implementation of deterministic and time analyzable applications on top of a streamlined run-time system. This Ravenscar tasking profile, originally designed for single processors, has proven remarkably useful for modelling verifiable real-time single-processor systems. This paper proposes a simple extension to the Ravenscar profile to support multi-processor systems using a fully partitioned approach. The implementation of this scheme is simple, and it can be used to develop applications amenable to schedulability analysis.
Resource Allocation Model for Modelling Abstract RTOS on Multiprocessor System-on-Chip
Virk, Kashif Munir; Madsen, Jan
2003-01-01
Resource Allocation is an important problem in RTOS's, and has been an active area of research. Numerous approaches have been developed and many different techniques have been combined for a wide range of applications. In this paper, we address the problem of resource allocation in the context...... of modelling an abstract RTOS on multiprocessor SoC platforms. We discuss the implementation details of a simplified basic priority inheritance protocol for our abstract system model in SystemC....
Method for wiring allocation and switch configuration in a multiprocessor environment
Aridor, Yariv [Zichron Ya'akov, IL; Domany, Tamar [Kiryat Tivon, IL; Frachtenberg, Eitan [Jerusalem, IL; Gal, Yoav [Haifa, IL; Shmueli, Edi [Haifa, IL; Stockmeyer, legal representative, Robert E.; Stockmeyer, Larry Joseph [San Jose, CA
2008-07-15
A method for wiring allocation and switch configuration in a multiprocessor computer, the method including employing depth-first tree traversal to determine a plurality of paths among a plurality of processing elements allocated to a job along a plurality of switches and wires in a plurality of D-lines, and selecting one of the paths in accordance with at least one selection criterion.
Communication and Memory Architecture Design of Application-Specific High-End Multiprocessors
Yahya Jan
2012-01-01
Full Text Available This paper is devoted to the design of communication and memory architectures of massively parallel hardware multiprocessors necessary for the implementation of highly demanding applications. We demonstrated that for the massively parallel hardware multiprocessors the traditionally used flat communication architectures and multi-port memories do not scale well, and the memory and communication network influence on both the throughput and circuit area dominates the processors influence. To resolve the problems and ensure scalability, we proposed to design highly optimized application-specific hierarchical and/or partitioned communication and memory architectures through exploring and exploiting the regularity and hierarchy of the actual data flows of a given application. Furthermore, we proposed some data distribution and related data mapping schemes in the shared (global partitioned memories with the aim to eliminate the memory access conflicts, as well as, to ensure that our communication design strategies will be applicable. We incorporated these architecture synthesis strategies into our quality-driven model-based multi-processor design method and related automated architecture exploration framework. Using this framework, we performed a large series of experiments that demonstrate many various important features of the synthesized memory and communication architectures. They also demonstrate that our method and related framework are able to efficiently synthesize well scalable memory and communication architectures even for the high-end multiprocessors. The gains as high as 12-times in performance and 25-times in area can be obtained when using the hierarchical communication networks instead of the flat networks. However, for the high parallelism levels only the partitioned approach ensures the scalability in performance.
Operating system for a real-time multiprocessor propulsion system simulator
Cole, G. L.
1984-01-01
The success of the Real Time Multiprocessor Operating System (RTMPOS) in the development and evaluation of experimental hardware and software systems for real time interactive simulation of air breathing propulsion systems was evaluated. The Real Time Multiprocessor Operating System (RTMPOS) provides the user with a versatile, interactive means for loading, running, debugging and obtaining results from a multiprocessor based simulator. A front end processor (FEP) serves as the simulator controller and interface between the user and the simulator. These functions are facilitated by the RTMPOS which resides on the FEP. The RTMPOS acts in conjunction with the FEP's manufacturer supplied disk operating system that provides typical utilities like an assembler, linkage editor, text editor, file handling services, etc. Once a simulation is formulated, the RTMPOS provides for engineering level, run time operations such as loading, modifying and specifying computation flow of programs, simulator mode control, data handling and run time monitoring. Run time monitoring is a powerful feature of RTMPOS that allows the user to record all actions taken during a simulation session and to receive advisories from the simulator via the FEP. The RTMPOS is programmed mainly in PASCAL along with some assembly language routines. The RTMPOS software is easily modified to be applicable to hardware from different manufacturers.
Rajabi, Mohammad Mahdi; Ataie-Ashtiani, Behzad; Janssen, Hans
2015-02-01
The majority of literature regarding optimized Latin hypercube sampling (OLHS) is devoted to increasing the efficiency of these sampling strategies through the development of new algorithms based on the combination of innovative space-filling criteria and specialized optimization schemes. However, little attention has been given to the impact of the initial design that is fed into the optimization algorithm, on the efficiency of OLHS strategies. Previous studies, as well as codes developed for OLHS, have relied on one of the following two approaches for the selection of the initial design in OLHS: (1) the use of random points in the hypercube intervals (random LHS), and (2) the use of midpoints in the hypercube intervals (midpoint LHS). Both approaches have been extensively used, but no attempt has been previously made to compare the efficiency and robustness of their resulting sample designs. In this study we compare the two approaches and show that the space-filling characteristics of OLHS designs are sensitive to the initial design that is fed into the optimization algorithm. It is also illustrated that the space-filling characteristics of OLHS designs based on midpoint LHS are significantly better those based on random LHS. The two approaches are compared by incorporating their resulting sample designs in Monte Carlo simulation (MCS) for uncertainty propagation analysis, and then, by employing the sample designs in the selection of the training set for constructing non-intrusive polynomial chaos expansion (NIPCE) meta-models which subsequently replace the original full model in MCSs. The analysis is based on two case studies involving numerical simulation of density dependent flow and solute transport in porous media within the context of seawater intrusion in coastal aquifers. We show that the use of midpoint LHS as the initial design increases the efficiency and robustness of the resulting MCSs and NIPCE meta-models. The study also illustrates that this
Westmijze, M.; Bekooij, Marco Jan Gerrit; Smit, Gerardus Johannes Maria
2013-01-01
Systems with chip multi-processors are currently used for several applications that have real-time requirements. In chip multi-processor architectures, many hardware resources such as parts of the cache hierarchy are shared between cores and by using such resources, applications can significantly
Al-Hallaq, A.; Amin, S.
1998-01-01
This paper introduces a new parallel algorithm and its simulation on a hypercube simulator for the low pass digital image filtering using a systolic array. This new algorithm is faster than the old one (Amin, 1988). This is due to the the fact that the old algorithm carries out the addition operations in a sequential mode. But in our new design these addition operations are divided into tow groups, which can be performed in parallel. One group will be performed on one half of the systolic array and the other on the second half, that is, by folding. This parallelism reduces the time required for the whole process by almost quarter the time of the old algorithm.(authors). 18 refs., 3 figs
Hamalainen, Sampsa; Geng, Xiaoyuan; He, Juanxia
2017-04-01
Latin Hypercube Sampling (LHS) at variable resolutions for enhanced watershed scale Soil Sampling and Digital Soil Mapping. Sampsa Hamalainen, Xiaoyuan Geng, and Juanxia, He. AAFC - Agriculture and Agr-Food Canada, Ottawa, Canada. The Latin Hypercube Sampling (LHS) approach to assist with Digital Soil Mapping has been developed for some time now, however the purpose of this work was to complement LHS with use of multiple spatial resolutions of covariate datasets and variability in the range of sampling points produced. This allowed for specific sets of LHS points to be produced to fulfil the needs of various partners from multiple projects working in the Ontario and Prince Edward Island provinces of Canada. Secondary soil and environmental attributes are critical inputs that are required in the development of sampling points by LHS. These include a required Digital Elevation Model (DEM) and subsequent covariate datasets produced as a result of a Digital Terrain Analysis performed on the DEM. These additional covariates often include but are not limited to Topographic Wetness Index (TWI), Length-Slope (LS) Factor, and Slope which are continuous data. The range of specific points created in LHS included 50 - 200 depending on the size of the watershed and more importantly the number of soil types found within. The spatial resolution of covariates included within the work ranged from 5 - 30 m. The iterations within the LHS sampling were run at an optimal level so the LHS model provided a good spatial representation of the environmental attributes within the watershed. Also, additional covariates were included in the Latin Hypercube Sampling approach which is categorical in nature such as external Surficial Geology data. Some initial results of the work include using a 1000 iteration variable within the LHS model. 1000 iterations was consistently a reasonable value used to produce sampling points that provided a good spatial representation of the environmental
Scalability of Parallel Spatial Direct Numerical Simulations on Intel Hypercube and IBM SP1 and SP2
Joslin, Ronald D.; Hanebutte, Ulf R.; Zubair, Mohammad
1995-01-01
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52-56 M ops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a "real world" simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application. KEY WORDS: Spatial direct numerical simulations; incompressible viscous flows; spectral methods; finite differences; parallel computing.
Ohmacht, Martin
2017-08-15
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Jozwiak, Lech; Madsen, Jan
2013-01-01
The recent spectacular progress in modern nano-dimension semiconductor technology enabled implementation of a complete complex multi-processor system on a single chip (MPSoC), global networking and mobile wire-less communication, and facilitated a fast progress in these areas. New important...... accessible or distant) objects, installations, machines or devices, or even implanted in human or animal body can serve as examples. However, many of the modern embedded application impose very stringent functional and parametric demands. Moreover, the spectacular advances in microelectronics introduced...
ARTiS, an Asymmetric Real-Time Scheduler for Linux on Multi-Processor Architectures
Piel , Éric; Marquet , Philippe; Soula , Julien; Osuna , Christophe; Dekeyser , Jean-Luc
2005-01-01
The ARTiS system is a real-time extension of the GNU/Linux scheduler dedicated to SMP (Symmetric Multi-Processors) systems. It allows to mix High Performance Computing and real-time. ARTiS exploits the SMP architecture to guarantee the preemption of a processor when the system has to schedule a real-time task. The implementation is available as a modification of the Linux kernel, especially focusing (but not restricted to) IA-64 architecture. The basic idea of ARTiS is to assign a selected se...
A high speed multi-tasking, multi-processor telemetry system
Wu, Kung Chris [Univ. of Texas, El Paso, TX (United States)
1996-12-31
This paper describes a small size, light weight, multitasking, multiprocessor telemetry system capable of collecting 32 channels of differential signals at a sampling rate of 6.25 kHz per channel. The system is designed to collect data from remote wind turbine research sites and transfer the data via wireless communication. A description of operational theory, hardware components, and itemized cost is provided. Synchronization with other data acquisition systems and test data on data transmission rates is also given. 11 refs., 7 figs., 4 tabs.
Energy-efficient fault tolerance in multiprocessor real-time systems
Guo, Yifeng
The recent progress in the multiprocessor/multicore systems has important implications for real-time system design and operation. From vehicle navigation to space applications as well as industrial control systems, the trend is to deploy multiple processors in real-time systems: systems with 4 -- 8 processors are common, and it is expected that many-core systems with dozens of processing cores will be available in near future. For such systems, in addition to general temporal requirement common for all real-time systems, two additional operational objectives are seen as critical: energy efficiency and fault tolerance. An intriguing dimension of the problem is that energy efficiency and fault tolerance are typically conflicting objectives, due to the fact that tolerating faults (e.g., permanent/transient) often requires extra resources with high energy consumption potential. In this dissertation, various techniques for energy-efficient fault tolerance in multiprocessor real-time systems have been investigated. First, the Reliability-Aware Power Management (RAPM) framework, which can preserve the system reliability with respect to transient faults when Dynamic Voltage Scaling (DVS) is applied for energy savings, is extended to support parallel real-time applications with precedence constraints. Next, the traditional Standby-Sparing (SS) technique for dual processor systems, which takes both transient and permanent faults into consideration while saving energy, is generalized to support multiprocessor systems with arbitrary number of identical processors. Observing the inefficient usage of slack time in the SS technique, a Preference-Oriented Scheduling Framework is designed to address the problem where tasks are given preferences for being executed as soon as possible (ASAP) or as late as possible (ALAP). A preference-oriented earliest deadline (POED) scheduler is proposed and its application in multiprocessor systems for energy-efficient fault tolerance is
Ohmacht, Martin
2014-09-09
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Multiprocessor based data acquisition system for radiation monitoring in nuclear reactors
Pansare, M.G.; Narsaiah, A.; Anantha Krishnan, T.S.
1989-01-01
Expensive minicomputers are required for building powerful Data Acquisition Systems (DAS) capable of scanning and processing large number of signals in a real-time environment. However by using the inexpensive microprocessors in multiprocessor configuration it is possible to build DASs that are as powerful as minicomputer based systems at much lesser cost. This paper describes such a multiprocessor based DAS designed for acquiring data from various radiation monitoring instruments of a nuclear reactor. The system is built by using MULTIBUS standard boards based on intel 8086, 16 bit microprocessor, with local and shared memory. The system monitors upto 128 analog input channels, 64 digital input channels and actuates upto 128 digital output contacts. The system continuously checks for the alarm condition of the input channels and displays the alarm status on an ALARM CRT. Facility has been provided for the transfer of data to a central computer. At any instant of time, the information regarding different channels being monitored is available from the local console as well as through five remote terminals located at various places in the reactor building. (author)
Development of a VME multi-processor system for plasma control at the JT-60 Upgrade
Takahashi, M.; Kurihara, K.; Kawamata, Y.; Akasaka, H.; Kimura, T.
1992-01-01
Design and initial operation results are reported of a VME multi-processor system [1] for plasma control at a large fusion device named 'the JT-60 Upgrade' utilizing three 32-bit MC88100 based RISC computers and VME components. Development of the system was stimulated by faster and more accurate computation requirements for the plasma position and current control. The RISC computers operate at 25 MHz along with two cashe memories named MC88200. We newly developed VME bus modules of up/down counter, analog-to-digital converter and clock pulse generator for measuring magnetic field and coil current and for synchronizing the processing in the three RISCs and direct digital controllers (DDCs) of magnet power supplies. We also evaluated that the speed of the data transfer between the VME bus system and the DDCs through CAMAC highways satisfies the above requirements. In the initial operation of the JT-60 upgrade, it has been proved that the VME multi-processor system well controls the plasma position and current with a sampling period of 250 μsec and a delay of 500 μsec. (author)
Quality-driven model-based design of multi-processor accelerators : an application to LDPC decoders
Jan, Y.
2012-01-01
The recent spectacular progress in nano-electronic technology has enabled the implementation of very complex multi-processor systems on single chips (MPSoCs). However in parallel, new highly demanding complex embedded applications are emerging, in fields like communication and networking,
A. Vandevelde; J.A. Hoogeveen; C.A.J. Hurkens (Cor); J.K. Lenstra (Jan Karel)
2005-01-01
htmlabstractThe multiprocessor flow-shop is the generalization of the flow-shop in which each machine is replaced by a set of identical machines. As finding a minimum-length schedule is NP-hard, we set out to find good lower and upper bounds. The lower bounds are based on relaxation of the
Vandevelde, A.; Hoogeveen, J.A.; Hurkens, C.A.J.; Lenstra, J.K.
2005-01-01
The multiprocessor flow-shop is the generalization of the flow-shop in which each machine is replaced by a set of identical machines. As finding a minimum-length schedule is NP-hard, we set out to find good lower and upper bounds. The lower bounds are based on relaxation of the capacities of all
WANG, P. T.
2015-12-01
Groundwater modeling requires to assign hydrogeological properties to every numerical grid. Due to the lack of detailed information and the inherent spatial heterogeneity, geological properties can be treated as random variables. Hydrogeological property is assumed to be a multivariate distribution with spatial correlations. By sampling random numbers from a given statistical distribution and assigning a value to each grid, a random field for modeling can be completed. Therefore, statistics sampling plays an important role in the efficiency of modeling procedure. Latin Hypercube Sampling (LHS) is a stratified random sampling procedure that provides an efficient way to sample variables from their multivariate distributions. This study combines the the stratified random procedure from LHS and the simulation by using LU decomposition to form LULHS. Both conditional and unconditional simulations of LULHS were develpoed. The simulation efficiency and spatial correlation of LULHS are compared to the other three different simulation methods. The results show that for the conditional simulation and unconditional simulation, LULHS method is more efficient in terms of computational effort. Less realizations are required to achieve the required statistical accuracy and spatial correlation.
Multi-processor system for real-time deconvolution and flow estimation in medical ultrasound
Jensen, Jesper Lomborg; Jensen, Jørgen Arendt; Stetson, Paul F.
1996-01-01
of the algorithms. Many of the algorithms can only be properly evaluated in a clinical setting with real-time processing, which generally cannot be done with conventional equipment. This paper therefore presents a multi-processor system capable of performing 1.2 billion floating point operations per second on RF...... filter is used with a second time-reversed recursive estimation step. Here it is necessary to perform about 70 arithmetic operations per RF sample or about 1 billion operations per second for real-time deconvolution. Furthermore, these have to be floating point operations due to the adaptive nature...... interfaced to our previously-developed real-time sampling system that can acquire RF data at a rate of 20 MHz and simultaneously transmit the data at 20 MHz to the processing system via several parallel channels. These two systems can, thus, perform real-time processing of ultrasound data. The advantage...
2-D fluid dynamics models for laser driven fusion on IBM 3090 vector multiprocessors
Atzeni, S.
1988-01-01
Fluid-dynamics codes for laser fusion are complex research codes, consisting of many distinct modules and embodying a variety of numerical methods. They are therefore good candidates for testing general purpose advanced computer architectures and the related software. In this paper, after a brief outline of the basic concepts of laser fusion, the implementation of the 2-D laser fusion fluid code DUED on the IBM 3090 VF vector multiprocessors is discussed. Emphasis is put on parallelization, performed by means of IBM Parallel FORTRAN (PF). It is shown how different modules have been optimized by using different features of PF: i) modules based on depth-2 nested loops exploit automatic parallelization; ii) laser light ray tracing is partitioned by scheduling parallel ICCG algorithm (executed in parallel by appropiately synchronized parallel subroutines). Performance results are given for separate modules of the code, as well as for typical complete runs
Commodity multi-processor systems in the ATLAS level-2 trigger
Abolins, M.; Blair, R.; Bock, R.; Bogaerts, A.; Dawson, J.; Ermoline, Y.; Hauser, R.; Kugel, A.; Lay, R.; Muller, M.; Noffz, K.-H.; Pope, B.; Schlereth, J.; Werner, P.
2000-01-01
Low cost SMP (Symmetric Multi-Processor) systems provide substantial CPU and I/O capacity. These features together with the ease of system integration make them an attractive and cost effective solution for a number of real-time applications in event selection. In ATLAS the authors consider them as intelligent input buffers (active ROB complex), as event flow supervisors or as powerful processing nodes. Measurements of the performance of one off-the-shelf commercial 4-processor PC with two PCI buses, equipped with commercial FPGA based data source cards (microEnable) and running commercial software are presented and mapped on such applications together with a long-term program of work. The SMP systems may be considered as an important building block in future data acquisition systems
Multi-processor data acquisition and monitoring systems for particle physics
White, V.; Burch, B.; Eng, K.; Heinicke, P.; Pyatetsky, M.; Ritchie, D.
1983-01-01
A high speed distributed processing system, using PDP-11 and VAX processors, is being developed at Fermilab. The acquisition of data is done using one or more PDP-11s. Additional processors are connected to provide either data logging or extra data analysis capabilities. Within this framework, functional interchangeability of PDP-11 and VAX processors and of the PDP-11 operating systems, RT-11 and RSX-11M, has been maintained. Inter-processor connections have been implemented in a general way using the 5 megabit DR11-W hardware currently selected for the purpose. Using this approach the authors have been able to make use of several existing data acquisition and analysis packages, such as RT/MULTI, in a multi-processor system
Commodity multi-processor systems in the ATLAS level-2 trigger
Abolins, M; Bock, R; Bogaerts, J A C; Dawson, J; Ermoline, Y; Hauser, R; Kugel, A; Lay, R; Müller, M; Noffz, K H; Pope, B; Schlereth, J L; Werner, P
2000-01-01
Low cost SMP (symmetric multi-processor) systems provide substantial CPU and I/O capacity. These features together with the ease of system integration make them an attractive and cost effective solution for a number of real-time applications in event selection. In ATLAS we consider them as intelligent input buffers (an "active" ROB complex), as event flow supervisors or as powerful processing nodes. Measurements of the performance of one off-the-shelf commercial 4- processor PC with two PCI buses, equipped with commercial FPGA based data source cards (microEnable) and running commercial software are presented and mapped on such applications together with a long-term programme of work. The SMP systems may be considered as an important building block in future data acquisition systems. (9 refs).
Job-mix modeling and system analysis of an aerospace multiprocessor.
Mallach, E. G.
1972-01-01
An aerospace guidance computer organization, consisting of multiple processors and memory units attached to a central time-multiplexed data bus, is described. A job mix for this type of computer is obtained by analysis of Apollo mission programs. Multiprocessor performance is then analyzed using: 1) queuing theory, under certain 'limiting case' assumptions; 2) Markov process methods; and 3) system simulation. Results of the analyses indicate: 1) Markov process analysis is a useful and efficient predictor of simulation results; 2) efficient job execution is not seriously impaired even when the system is so overloaded that new jobs are inordinately delayed in starting; 3) job scheduling is significant in determining system performance; and 4) a system having many slow processors may or may not perform better than a system of equal power having few fast processors, but will not perform significantly worse.
Kutt, P.H.; Balamuth, D.P.
1989-01-01
A multiprocessor system based on commercially available VMEbus components has been developed for the acquisition and reduction of event-mode data in nuclear physics experiments. The system contains seven 68000 CPU's and 14 MB of memory. A minimal operating system handles data transfer and task allocation, and a compiler for a specially designed event analysis language produces code for the processors. The system has been in operation for four years at the University of Pennsylvania Tandem Accelerator Laboratory. Computation rates over 3 times that of a MicroVAX II have been achieved at a fraction of the cost. The use of WORM optical disks for event recording allows the processing for gigabyte data sets without operator intervention. A more powerful system is being planned which will make use of recently developed RISC processors to obtain an order of magnitude increase in computing power per node
A Survey of Rollback-Recovery Protocols in Message-Passing Systems
1999-06-01
and M.A. Castillo. "Checkpointing through garbage collection." Technical report. Departamento de Ciencia de la Computation, Escuela de Ingenieria ...between consecutive checkpoints. It can be implemented by using the dirty-bit of the memory protection hardware or by emulating a dirty-bit in software [4...compare the program’s state with the previous checkpoint in software , and writing the difference in a new checkpoint [46]. The required storage and
Scalable High Performance Message Passing over InfiniBand for Open MPI
Friedley, A; Hoefler, T; Leininger, M L; Lumsdaine, A
2007-10-24
InfiniBand (IB) is a popular network technology for modern high-performance computing systems. MPI implementations traditionally support IB using a reliable, connection-oriented (RC) transport. However, per-process resource usage that grows linearly with the number of processes, makes this approach prohibitive for large-scale systems. IB provides an alternative in the form of a connectionless unreliable datagram transport (UD), which allows for near-constant resource usage and initialization overhead as the process count increases. This paper describes a UD-based implementation for IB in Open MPI as a scalable alternative to existing RC-based schemes. We use the software reliability capabilities of Open MPI to provide the guaranteed delivery semantics required by MPI. Results show that UD not only requires fewer resources at scale, but also allows for shorter MPI startup times. A connectionless model also improves performance for applications that tend to send small messages to many different processes.
Message-Passing Receiver for OFDM Systems over Highly Delay-Dispersive Channels
Barbu, Oana-Elena; Manchón, Carles Navarro; Rom, Christian
2017-01-01
Propagation channels with maximum excess delay exceeding the duration of the cyclic prefix (CP) in OFDM systems cause intercarrier and intersymbol interference which, unless accounted for, degrade the receiver performance. Using tools from Bayesian inference and sparse signal reconstruction, we...... derive an iterative algorithm that estimates an approximate representation of the channel impulse response and the noise variance, estimates and cancels the intrinsic interference and decodes the data over a block of symbols. Simulation results show that the receiver employing our algorithm outperforms...
A model based message passing approach for flexible and scalable home automation controllers
Bienhaus, D. [INNIAS GmbH und Co. KG, Frankenberg (Germany); David, K.; Klein, N.; Kroll, D. [ComTec Kassel Univ., SE Kassel Univ. (Germany); Heerdegen, F.; Jubeh, R.; Zuendorf, A. [Kassel Univ. (Germany). FG Software Engineering; Hofmann, J. [BSC Computer GmbH, Allendorf (Germany)
2012-07-01
There is a large variety of home automation systems that are largely proprietary systems from different vendors. In addition, the configuration and administration of home automation systems is frequently a very complex task especially, if more complex functionality shall be achieved. Therefore, an open model for home automation was developed that is especially designed for easy integration of various home automation systems. This solution also provides a simple modeling approach that is inspired by typical home automation components like switches, timers, etc. In addition, a model based technology to achieve rich functionality and usability was implemented. (orig.)
Shared memory and message passing revisited in the many-core era
CERN. Geneva
2016-01-01
In the 70s, Edsgar Dijkstra, Per Brinch Hansen and C.A.R Hoare introduced the fundamental concepts for concurrent computing. It was clear that concrete communication mechanisms were required in order to achieve effective concurrency. Whether you're developing a multithreaded program running on a single node, or a distributed system spanning over hundreds of thousands cores, the choice of the communication mechanism for your system must be done intelligently because of the implicit programmability, performance and scalability trade-offs. With the emergence of many-core computing architectures many assumptions may not be true anymore. In this talk we will try to provide insight on the characteristics of these communication models by providing basic theoretical background and then focus on concrete practical examples based on indicative use case scenarios. The case studies of this presentation cover popular programming models, operating systems and concurrency frameworks in the context of many-core processors.
Solchenbach, K.; Thole, C.A.; Trottenberg, U.
1987-01-01
For a wide class of problems in scientific computing, in particular for partial differential equations, the multigrid principle has proved to yield highly efficient numerical methods. However, the principle has to be applied carefully: if the multigrid components are not chosen adequately with respect to the given problem, the efficiency may be much smaller than possible. This has been demonstrated for many practical problems. Unfortunately, the general theories on multigrid convergence do not give much help in constructing really efficient multigrid algorithms. Although some progress has been made in bridging the gap between theory and practice during the last few years, there are still several theoretical approaches which are misleading rather than helpful with respect to the objective of real efficiency. The research in finding highly efficient algorithms for non-model applications therefore is still a sophisticated mixture of theoretical considerations, a transfer of experiences from model to real life problems and systematical experimental work. The emphasis of the practical research activity today lies - among others - in the following fields: - finding efficient multigrid components for really complex problems, - combining the multigrid approach with advanced discretizative techniques: - constructing highly parallel multigrid algorithms. In this paper, we want to deal mainly with the last topic
Bradley, D. B.; Irwin, J. D.
1974-01-01
A computer simulation model for a multiprocessor computer is developed that is useful for studying the problem of matching multiprocessor's memory space, memory bandwidth and numbers and speeds of processors with aggregate job set characteristics. The model assumes an input work load of a set of recurrent jobs. The model includes a feedback scheduler/allocator which attempts to improve system performance through higher memory bandwidth utilization by matching individual job requirements for space and bandwidth with space availability and estimates of bandwidth availability at the times of memory allocation. The simulation model includes provisions for specifying precedence relations among the jobs in a job set, and provisions for specifying precedence execution of TMR (Triple Modular Redundant and SIMPLEX (non redundant) jobs.
Use of a genetic algorithm to solve two-fluid flow problems on an NCUBE multiprocessor computer
Pryor, R.J.; Cline, D.D.
1992-01-01
A method of solving the two-phase fluid flow equations using a genetic algorithm on a NCUBE multiprocessor computer is presented. The topics discussed are the two-phase flow equations, the genetic representation of the unknowns, the fitness function, the genetic operators, and the implementation of the algorithm on the NCUBE computer. The efficiency of the implementation is investigated using a pipe blowdown problem. Effects of varying the genetic parameters and the number of processors are presented
Use of a genetic agorithm to solve two-fluid flow problems on an NCUBE multiprocessor computer
Pryor, R.J.; Cline, D.D.
1993-01-01
A method of solving the two-phases fluid flow equations using a genetic algorithm on a NCUBE multiprocessor computer is presented. The topics discussed are the two-phase flow equations, the genetic representation of the unkowns, the fitness function, the genetic operators, and the implementation of the algorithm on the NCUBE computer. The efficiency of the implementation is investigated using a pipe blowdown problem. Effects of varying the genetic parameters and the number of processors are presented. (orig.)
Gaines, I.
1988-03-01
The use of multi-processors for analysis and high-level triggering in High Energy Physics experiments, pioneered by the early emulator systems, has reached maturity, in particular with the multiple microprocessor systems in use at Fermilab. It is widely acknowledged that such systems will fulfill the major portion of the computing needs of future large experiments. Recent developments at Fermilab's Advanced Computer Program will make such systems even more powerful, cost-effective, and easier to use than they are at present. The next generation of microprocessors, already available, will provide CPU power of about one VAX 780 equivalent/$300, while supporting most VMS FORTRAN extensions and large (>8MB) amounts of memory. Low cost high density mass storage devices (based on video tape cartridge technology) will allow parallel I/O to remove potential I/O bottlenecks in systems of over 1000 VAX equipment processors. New interconnection schemes and system software will allow more flexible topologies and extremely high data bandwidth, especially for on-line systems. This talk will summarize the work at the Advanced Computer Program and the rest of the US in this field. 3 refs., 4 figs
Compilation time analysis to minimize run-time overhead in preemptive scheduling on multiprocessors
Wauters, Piet; Lauwereins, Rudy; Peperstraete, J.
1994-10-01
This paper describes a scheduling method for hard real-time Digital Signal Processing (DSP) applications, implemented on a multi-processor. Due to the very high operating frequencies of DSP applications (typically hundreds of kHz) runtime overhead should be kept as small as possible. Because static scheduling introduces very little run-time overhead it is used as much as possible. Dynamic pre-emption of tasks is allowed if and only if it leads to better performance in spite of the extra run-time overhead. We essentially combine static scheduling with dynamic pre-emption using static priorities. Since we are dealing with hard real-time applications we must be able to guarantee at compile-time that all timing requirements will be satisfied at run-time. We will show that our method performs at least as good as any static scheduling method. It also reduces the total amount of dynamic pre-emptions compared with run time methods like deadline monotonic scheduling.
Zilker, M.; Hallatschek, K.; Heimann, P.; Hertweck, F.
1999-01-01
In this paper we present our transputer-based multitop multiprocessor systems for data acquisition, which are currently used on the Asdex upgrade experiment. The bandwidth of these systems goes from low-speed like the calorimetry diagnostic up to highspeed and large data volume systems like the soft-X-ray and Mirnov diagnostics, which collect several hundreds of megabytes of data during a plasma discharge of ∼8 s. Further, we present the multitop-MX, a newly developed system based on transputers and powerPCs, which provides real-time facilities for analysing the acquired data, to generate necessary information for the dynamic adaptation of sample rates, and to deliver triggers when certain events in the plasma are detected. The algorithm running on the powerPCs performs a wavelet like time-frequency transform. In the last part we give an outlook how to build the next generation of data acquisition systems to be used on the future plasma experiments W7-X and ITER, but also on Asdex upgrade. The hardware of these new distributed systems should be mainly based on established industry standards like the VME-bus, PCI-bus and FiberChannel, but also emerging technologies like SCI (scalable coherent interconnect) should be considered. The systems software should be well designed with object oriented methods to simplify the maintenance process and to enable further expansions and adaptations to new problems in an easy way. (orig.)
Trade-Off Exploration for Target Tracking Application in a Customized Multiprocessor Architecture
Yassin El-Hillali
2009-01-01
Full Text Available This paper presents the design of an FPGA-based multiprocessor-system-on-chip (MPSoC architecture optimized for Multiple Target Tracking (MTT in automotive applications. An MTT system uses an automotive radar to track the speed and relative position of all the vehicles (targets within its field of view. As the number of targets increases, the computational needs of the MTT system also increase making it difficult for a single processor to handle it alone. Our implementation distributes the computational load among multiple soft processor cores optimized for executing specific computational tasks. The paper explains how we designed and profiled the MTT application to partition it among different processors. It also explains how we applied different optimizations to customize the individual processor cores to their assigned tasks and to assess their impact on performance and FPGA resource utilization. The result is a complete MTT application running on an optimized MPSoC architecture that fits in a contemporary medium-sized FPGA and that meets the application's real-time constraints.
Scheduling for energy and reliability management on multiprocessor real-time systems
Qi, Xuan
Scheduling algorithms for multiprocessor real-time systems have been studied for years with many well-recognized algorithms proposed. However, it is still an evolving research area and many problems remain open due to their intrinsic complexities. With the emergence of multicore processors, it is necessary to re-investigate the scheduling problems and design/develop efficient algorithms for better system utilization, low scheduling overhead, high energy efficiency, and better system reliability. Focusing cluster schedulings with optimal global schedulers, we study the utilization bound and scheduling overhead for a class of cluster-optimal schedulers. Then, taking energy/power consumption into consideration, we developed energy-efficient scheduling algorithms for real-time systems, especially for the proliferating embedded systems with limited energy budget. As the commonly deployed energy-saving technique (e.g. dynamic voltage frequency scaling (DVFS)) will significantly affect system reliability, we study schedulers that have intelligent mechanisms to recuperate system reliability to satisfy the quality assurance requirements. Extensive simulation is conducted to evaluate the performance of the proposed algorithms on reduction of scheduling overhead, energy saving, and reliability improvement. The simulation results show that the proposed reliability-aware power management schemes could preserve the system reliability while still achieving substantial energy saving.
E-Token Energy-Aware Proportionate Sharing Scheduling Algorithm for Multiprocessor Systems
Pasupuleti Ramesh
2017-01-01
Full Text Available WSN plays vital role from small range healthcare surveillance systems to largescale environmental monitoring. Its design for energy constrained applications is a challenging issue. Sensors in WSNs are projected to run separately for longer periods. It is of excessive cost to substitute exhausted batteries which is not even possible in antagonistic situations. Multiprocessors are used in WSNs for high performance scientific computing, where each processor is assigned the same or different workload. When the computational demands of the system increase then the energy efficient approaches play an important role to increase system lifetime. Energy efficiency is commonly carried out by using proportionate fair scheduler. This introduces abnormal overloading effect. In order to overcome the existing problems E-token Energy-Aware Proportionate Sharing (EEAPS scheduling is proposed here. The power consumption for each thread/task is calculated and the tasks are allotted to the multiple processors through the auctioning mechanism. The algorithm is simulated by using the real-time simulator (RTSIM and the results are tested.
A Taxonomy of Reconfigurable Single-/Multiprocessor Systems-on-Chip
Diana Göhringer
2009-01-01
Full Text Available Runtime adaptivity of hardware in processor architectures is a novel trend, which is under investigation in a variety of research labs all over the world. The runtime exchange of modules, implemented on a reconfigurable hardware, affects the instruction flow (e.g., in reconfigurable instruction set processors or the data flow, which has a strong impact on the performance of an application. Furthermore, the choice of a certain processor architecture related to the class of target applications is a crucial point in application development. A simple example is the domain of high-performance computing applications found in meteorology or high-energy physics, where vector processors are the optimal choice. A classification scheme for computer systems was provided in 1966 by Flynn where single/multiple data and instruction streams were combined to four types of architectures. This classification is now used as a foundation for an extended classification scheme including runtime adaptivity as further degree of freedom for processor architecture design. The developed scheme is validated by a multiprocessor system implemented on reconfigurable hardware as well as by a classification of existing static and reconfigurable processor systems.
Lin, Yu-Pin; Chu, Hone-Jay; Wang, Cheng-Long; Yu, Hsiao-Hsuan; Wang, Yung-Chieh
2009-01-01
This study applies variogram analyses of normalized difference vegetation index (NDVI) images derived from SPOT HRV images obtained before and after the ChiChi earthquake in the Chenyulan watershed, Taiwan, as well as images after four large typhoons, to delineate the spatial patterns, spatial structures and spatial variability of landscapes caused by these large disturbances. The conditional Latin hypercube sampling approach was applied to select samples from multiple NDVI images. Kriging and sequential Gaussian simulation with sufficient samples were then used to generate maps of NDVI images. The variography of NDVI image results demonstrate that spatial patterns of disturbed landscapes were successfully delineated by variogram analysis in study areas. The high-magnitude Chi-Chi earthquake created spatial landscape variations in the study area. After the earthquake, the cumulative impacts of typhoons on landscape patterns depended on the magnitudes and paths of typhoons, but were not always evident in the spatiotemporal variability of landscapes in the study area. The statistics and spatial structures of multiple NDVI images were captured by 3,000 samples from 62,500 grids in the NDVI images. Kriging and sequential Gaussian simulation with the 3,000 samples effectively reproduced spatial patterns of NDVI images. However, the proposed approach, which integrates the conditional Latin hypercube sampling approach, variogram, kriging and sequential Gaussian simulation in remotely sensed images, efficiently monitors, samples and maps the effects of large chronological disturbances on spatial characteristics of landscape changes including spatial variability and heterogeneity.
Lin, Yu-Pin; Chu, Hone-Jay; Huang, Yu-Long; Tang, Chia-Hsi; Rouhani, Shahrokh
2011-06-01
This study develops a stratified conditional Latin hypercube sampling (scLHS) approach for multiple, remotely sensed, normalized difference vegetation index (NDVI) images. The objective is to sample, monitor, and delineate spatiotemporal landscape changes, including spatial heterogeneity and variability, in a given area. The scLHS approach, which is based on the variance quadtree technique (VQT) and the conditional Latin hypercube sampling (cLHS) method, selects samples in order to delineate landscape changes from multiple NDVI images. The images are then mapped for calibration and validation by using sequential Gaussian simulation (SGS) with the scLHS selected samples. Spatial statistical results indicate that in terms of their statistical distribution, spatial distribution, and spatial variation, the statistics and variograms of the scLHS samples resemble those of multiple NDVI images more closely than those of cLHS and VQT samples. Moreover, the accuracy of simulated NDVI images based on SGS with scLHS samples is significantly better than that of simulated NDVI images based on SGS with cLHS samples and VQT samples, respectively. However, the proposed approach efficiently monitors the spatial characteristics of landscape changes, including the statistics, spatial variability, and heterogeneity of NDVI images. In addition, SGS with the scLHS samples effectively reproduces spatial patterns and landscape changes in multiple NDVI images.
Romanuke Vadim
2015-07-01
Full Text Available The paper suggests a method of obtaining an approximate solution of the infinite noncooperative game on the unit hypercube. The method is based on sampling uniformly the players’ payoff functions with the constant step along each of the hypercube dimensions. The author states the conditions for a sufficiently accurate sampling and suggests the method of reshaping the multidimensional matrix of the player’s payoff values, being the former player’s payoff function before its sampling, into a matrix with minimally possible number of dimensions, where also maintenance of one-to-one indexing has been provided. Requirements for finite NE-strategy from NE (Nash equilibrium solution of the finite game as the initial infinite game approximation are given as definitions of the approximate solution consistency. The approximate solution consistency ensures its relative independence upon the sampling step within its minimal neighborhood or the minimally decreased sampling step. The ultimate reshaping of multidimensional matrices of players’ payoff values to the minimal number of dimensions, being equal to the number of players, stimulates shortened computations.
Use of the CAMAC-MULTIBUS combined protocol for organizing multi-processor operation in a crate
Glejbman, Eh.M.
1985-01-01
Problems of developing electronic units for large on-line systems for nuclear-physical experiments automation and developed on the base of principles of distributed control and data processing are discussed. Crates with simultaneous disposition and operation of CAMAC moduli (EUR-4100) and those realizing the MULTIBUS hardcopy log in dataway are described. It is attained due to sharing the CAMAC and the MULTIBUS hardcopy logs in the crate dataway. Application of job scheduler and executor moduli in the MULTIBUS interface permits to organize multiprocessor operation and to obtain separation of data stream as well as to increase total computational capacity in the crate
Biedermann, Alexander
2014-01-01
Alexander Biedermann presents a generic hardware-based virtualization approach, which may transform an array of any off-the-shelf embedded processors into a multi-processor system with high execution dynamism. Based on this approach, he highlights concepts for the design of energy aware systems, self-healing systems as well as parallelized systems. For the latter, the novel so-called Agile Processing scheme is introduced by the author, which enables a seamless transition between sequential and parallel execution schemes. The design of such virtualizable systems is further aided by introduction
Hammer, C.; Paffrath, M.; Boeer, R.; Finnemann, H.; Jackson, C.J.
1996-01-01
The light water reactor core simulation code PANBOX has been coupled with the transient analysis code RELAP5 for the purpose of performing plant safety analyses with a three-dimensional (3-D) neutron kinetics model. The system has been parallelized to improve the computational efficiency. The paper describes the features of this system with emphasis on performance aspects. Performance results are given for different types of parallelization, i. e. for using an automatic parallelizing compiler, using the portable PVM platform on a workstation cluster, using PVM on a shared memory multiprocessor, and for using machine dependent interfaces. (author)
Design of Networks-on-Chip for Real-Time Multi-Processor Systems-on-Chip
Sparsø, Jens
2012-01-01
This paper addresses the design of networks-on-chips for use in multi-processor systems-on-chips - the hardware platforms used in embedded systems. These platforms typically have to guarantee real-time properties, and as the network is a shared resource, it has to provide service guarantees...... (bandwidth and/or latency) to different communication flows. The paper reviews some past work in this field and the lessons learned, and the paper discusses ongoing research conducted as part of the project "Time-predictable Multi-Core Architecture for Embedded Systems" (T-CREST), supported by the European...
Popovici, Katalin; Jerraya, Ahmed A; Wolf, Marilyn
2010-01-01
Current multimedia and telecom applications require complex, heterogeneous multiprocessor system on chip (MPSoC) architectures with specific communication infrastructure in order to achieve the required performance. Heterogeneous MPSoC includes different types of processing units (DSP, microcontroller, ASIP) and different communication schemes (fast links, non standard memory organization and access).Programming an MPSoC requires the generation of efficient software running on MPSoC from a high level environment, by using the characteristics of the architecture. This task is known to be tediou
Curchitser, E.N.; Pelz, R.B.; Marconi, F.
1992-01-01
The Euler and Navier-Stokes equations are solved for the steady, two-dimensional flow over a NACA 0012 airfoil using a 1024 node nCUBE/2 multiprocessor. Second-order, upwind-discretized difference equations are solved implicitly using ADI factorization. Parallel cyclic reduction is employed to solve the block tridiagonal systems. For realistic problems, communication times are negligible compared to calculation times. The processors are tightly synchronized, and their loads are well balanced. When the flux Jacobians flux are frozen, the wall-clock time for one implicit timestep is about equal to that of a multistage explicit scheme. 10 refs
Parallel-vector algorithms for particle simulations on shared-memory multiprocessors
Nishiura, Daisuke; Sakaguchi, Hide
2011-01-01
Over the last few decades, the computational demands of massive particle-based simulations for both scientific and industrial purposes have been continuously increasing. Hence, considerable efforts are being made to develop parallel computing techniques on various platforms. In such simulations, particles freely move within a given space, and so on a distributed-memory system, load balancing, i.e., assigning an equal number of particles to each processor, is not guaranteed. However, shared-memory systems achieve better load balancing for particle models, but suffer from the intrinsic drawback of memory access competition, particularly during (1) paring of contact candidates from among neighboring particles and (2) force summation for each particle. Here, novel algorithms are proposed to overcome these two problems. For the first problem, the key is a pre-conditioning process during which particle labels are sorted by a cell label in the domain to which the particles belong. Then, a list of contact candidates is constructed by pairing the sorted particle labels. For the latter problem, a table comprising the list indexes of the contact candidate pairs is created and used to sum the contact forces acting on each particle for all contacts according to Newton's third law. With just these methods, memory access competition is avoided without additional redundant procedures. The parallel efficiency and compatibility of these two algorithms were evaluated in discrete element method (DEM) simulations on four types of shared-memory parallel computers: a multicore multiprocessor computer, scalar supercomputer, vector supercomputer, and graphics processing unit. The computational efficiency of a DEM code was found to be drastically improved with our algorithms on all but the scalar supercomputer. Thus, the developed parallel algorithms are useful on shared-memory parallel computers with sufficient memory bandwidth.
Thimmaiah, Tim; Voje, William E; Carothers, James M
2015-01-01
With progress toward inexpensive, large-scale DNA assembly, the demand for simulation tools that allow the rapid construction of synthetic biological devices with predictable behaviors continues to increase. By combining engineered transcript components, such as ribosome binding sites, transcriptional terminators, ligand-binding aptamers, catalytic ribozymes, and aptamer-controlled ribozymes (aptazymes), gene expression in bacteria can be fine-tuned, with many corollaries and applications in yeast and mammalian cells. The successful design of genetic constructs that implement these kinds of RNA-based control mechanisms requires modeling and analyzing kinetically determined co-transcriptional folding pathways. Transcript design methods using stochastic kinetic folding simulations to search spacer sequence libraries for motifs enabling the assembly of RNA component parts into static ribozyme- and dynamic aptazyme-regulated expression devices with quantitatively predictable functions (rREDs and aREDs, respectively) have been described (Carothers et al., Science 334:1716-1719, 2011). Here, we provide a detailed practical procedure for computational transcript design by illustrating a high throughput, multiprocessor approach for evaluating spacer sequences and generating functional rREDs. This chapter is written as a tutorial, complete with pseudo-code and step-by-step instructions for setting up a computational cluster with an Amazon, Inc. web server and performing the large numbers of kinefold-based stochastic kinetic co-transcriptional folding simulations needed to design functional rREDs and aREDs. The method described here should be broadly applicable for designing and analyzing a variety of synthetic RNA parts, devices and transcripts.
Russo, Vincent; Johnston, Gary; Campbell, Roy
1988-01-01
The programming of the interrupt handling mechanisms, process switching primitives, scheduling mechanism, and synchronization primitives of an operating system for a multiprocessor require both efficient code in order to support the needs of high- performance or real-time applications and careful organization to facilitate maintenance. Although many advantages have been claimed for object-oriented class hierarchical languages and their corresponding design methodologies, the application of these techniques to the design of the primitives within an operating system has not been widely demonstrated. To investigate the role of class hierarchical design in systems programming, the authors have constructed the Choices multiprocessor operating system architecture the C++ programming language. During the implementation, it was found that many operating system design concerns can be represented advantageously using a class hierarchical approach, including: the separation of mechanism and policy; the organization of an operating system into layers, each of which represents an abstract machine; and the notions of process and exception management. In this paper, we discuss an implementation of the low-level primitives of this system and outline the strategy by which we developed our solution.
Petelet, Matthieu; Asserin, Olivier; Iooss, Bertrand; Petelet, Matthieu; Loredo, Alexandre
2006-01-01
In this work, the method of sensitivity analysis allowing to identify the inlet data the most influential on the variability of the responses (residual stresses and distortions). Classically, the sensitivity analysis is carried out locally what limits its validity domain to a given material. A global sensitivity analysis method is proposed; it allows to cover a material domain as wide as those of the steels series. A probabilistic modeling giving the variability of the material parameters in the steels series is proposed. The original aspect of this work consists in the use of the sampling method by latin hypercubes (LHS) of the material parameters which forms the inlet data (dependent of temperature) of the numerical simulations. Thus, a statistical approach has been applied to the welding numerical simulation: LHS sampling of the material properties, global sensitivity analysis what has allowed the reduction of the material parameterization. (O.M.)
Kovarik, M.D.; Barnes, T.; Tennessee Univ., Knoxville, TN
1993-01-01
We describe a Monte Carlo simulation of a dynamical fermion problem in two spatial dimensions on an Intel iPSC/860 hypercube. The problem studied is the determination of the dispersion relation of a dynamical hole in the t-J model of the high temperature superconductors. Since this problem involves the motion of many fermions in more than one spatial dimensions, it is representative of the class of systems that suffer from the ''minus sign problem'' of dynamical fermions which has made Monte Carlo simulation very difficult. We demonstrate that for small values of the hole hopping parameter one can extract the entire hole dispersion relation using the GRW Monte Carlo algorithm, which is a simulation of the Euclidean time Schroedinger equation, and present results on 4 x 4 and 6 x 6 lattices. Generalization to physical hopping parameter values wig only require use of an improved trial wavefunction for importance sampling
Dols, V. J.; Delamere, P. A.; Bagenal, F.; Cassidy, T. A.; Crary, F. J.
2014-12-01
We model the interaction of Europa's tenuous atmosphere with the plasma of Jupiter's torus with an improved version of our hybrid plasma code. In a hybrid plasma code, the ions are treated as kinetic Macro-particles moving under the Lorentz force and the electrons as a fluid leading to a generalized formulation of Ohm's law. In this version, the spatial simulation domain is decomposed in 2 directions and is non-uniform in the plasma convection direction. The code is run on a multi-processor supercomputer that offers 16416 cores and 2GB Ram per core. This new version allows us to tap into the large memory of the supercomputer and simulate the full interaction volume (Reuropa=1561km) with a high spatial resolution (50km). Compared to Io, Europa's atmosphere is about 100 times more tenuous, the ambient magnetic field is weaker and the density of incident plasma is lower. Consequently, the electrodynamic interaction is also weaker and substantial fluxes of thermal torus ions might reach and sputter the icy surface. Molecular O2 is the dominant atmospheric product of this surface sputtering. Observations of oxygen UV emissions (specifically the ratio of OI 1356A / 1304A emissions) are roughly consistent with an atmosphere that is composed predominantely of O2 with a small amount of atomic O. Galileo observations along flybys close to Europa have revealed the existence of induced currents in a conducting ocean under the icy crust. They also showed that, from flyby to flyby, the plasma interaction is very variable. Asymmetries of the plasma density and temperature in the wake of Europa were also observed and still elude a clear explanation. Galileo mag data also detected ion cyclotron waves, which is an indication of heavy ion pickup close to the moon. We prescribe an O2 atmosphere with a vertical density column consistent with UV observations and model the plasma properties along several Galileo flybys of the moon. We compare our results with the magnetometer
J. M. Vali Samani
2016-02-01
Full Text Available Introduction: The greatest part of constructed dams belongs to embankment dams and there are many examples of their failures throughout history. About one-third of the world’s dam failures have been caused by flood overtopping, which indicates that flood overtopping is an important factor affecting reservoir projects’ safety. Moreover, because of a poor understanding of the randomness of floods, reservoir water levels during flood seasons are often lowered artificially in order to avoid overtopping and protect the lives and property of downstream residents. So, estimation of dam overtopping risk with regard to uncertainties is more important than achieving the dam’s safety. This study presents the procedure for risk evaluation of dam overtopping due to various uncertaintiess in inﬂows and reservoir initial condition. Materials and Methods: This study aims to present a practical approach and compare the different uncertainty analysis methods in the evaluation of dam overtopping risk due to flood. For this purpose, Monte Carlo simulation and Latin hypercube sampling methods were used to calculate the overtopping risk, evaluate the uncertainty, and calculate the highest water level during different flood events. To assess these methods from a practical point of view, the Maroon dam was chosen for the case study. Figure. 1 indicates the work procedure, including three parts: 1 Identification and evaluation of effective factors on flood routing and dam overtopping, 2 Data collection and analysis for reservoir routing and uncertainty analysis, 3 Uncertainty and risk analysis. Figure 1- Diagram of dam overtopping risk evaluation Results and Discussion: Figure 2 shows the results of the computed overtopping risks for the Maroon Dam without considering the wind effect, for the initial water level of 504 m as an example. As it is shown in Figure. 2, the trends of the risk curves computed by the different uncertainty analysis methods are similar
Zou, Liang; Fu, Zhuang; Zhao, YanZheng; Yang, JunYan
2010-07-01
This paper proposes a kind of pipelined electric circuit architecture implemented in FPGA, a very large scale integrated circuit (VLSI), which efficiently deals with the real time non-uniformity correction (NUC) algorithm for infrared focal plane arrays (IRFPA). Dual Nios II soft-core processors and a DSP with a 64+ core together constitute this image system. Each processor undertakes own systematic task, coordinating its work with each other's. The system on programmable chip (SOPC) in FPGA works steadily under the global clock frequency of 96Mhz. Adequate time allowance makes FPGA perform NUC image pre-processing algorithm with ease, which has offered favorable guarantee for the work of post image processing in DSP. And at the meantime, this paper presents a hardware (HW) and software (SW) co-design in FPGA. Thus, this systematic architecture yields an image processing system with multiprocessor, and a smart solution to the satisfaction with the performance of the system.
Singh, R. R. P.; Young, A. P.
2017-08-01
We study the ±J transverse-field Ising spin-glass model at zero temperature on d -dimensional hypercubic lattices and in the Sherrington-Kirkpatrick (SK) model, by series expansions around the strong-field limit. In the SK model and in high dimensions our calculated critical properties are in excellent agreement with the exact mean-field results, surprisingly even down to dimension d =6 , which is below the upper critical dimension of d =8 . In contrast, at lower dimensions we find a rich singular behavior consisting of critical and Griffiths-McCoy singularities. The divergence of the equal-time structure factor allows us to locate the critical coupling where the correlation length diverges, implying the onset of a thermodynamic phase transition. We find that the spin-glass susceptibility as well as various power moments of the local susceptibility become singular in the paramagnetic phase before the critical point. Griffiths-McCoy singularities are very strong in two dimensions but decrease rapidly as the dimension increases. We present evidence that high enough powers of the local susceptibility may become singular at the pure-system critical point.
A Monte Carlo study of the ''minus sign problem'' in the t-J model using an intel IPSC/860 hypercube
Kovarik, M.D.; Barnes, T.; Tennessee Univ., Knoxville, TN
1993-01-01
We describe a Monte Carlo simulation of the 2-dimensional t-J model on an Intel iPSC/860 hypercube. The problem studied is the determination of the dispersion relation of a dynamical hole in the t-J model of the high temperature superconductors. Since this problem involves the motion of many fermions in more than one spatial dimensions, it is representative of the class of systems that suffer from the ''minus sign problem'' of dynamical fermions which has made Monte Carlo simulation very difficult. We demonstrate that for small values of the hole hopping parameter one can extract the entire hole dispersion relation using the GRW Monte Carlo algorithm, which is a simulation of the Euclidean time Schroedinger equation, and present results on 4 x 4 and 6 x 6 lattices. We demonstrate that a qualitative picture at higher hopping parameters may be found by extrapolating weak hopping results where the minus sign problem is less severe. Generalization to physical hopping parameter values will only require use of an improved trial wavefunction for importance sampling
Wonil Choi
2009-01-01
Full Text Available For mobile multiprocessor applications, achieving high performance with low energy consumption is a challenging task. In order to help programmers to meet these design requirements, system development tools play an important role. In this paper, we describe one such development tool, ePRO-MP, which profiles and optimizes both performance and energy consumption of multi-threaded applications running on top of Linux for ARM11 MPCore-based embedded systems. One of the key features of ePRO-MP is that it can accurately estimate the energy consumption of multi-threaded applications without requiring a power measurement equipment, using a regression-based energy model. We also describe another key benefit of ePRO-MP, an automatic optimization function, using two example problems. Using the automatic optimization function, ePRO-MP can achieve high performance and low power consumption without programmer intervention. Our experimental results show that ePRO-MP can improve the performance and energy consumption by 6.1% and 4.1%, respectively, over a baseline version for the co-running applications optimization example. For the producer-consumer application optimization example, ePRO-MP improves the performance and energy consumption by 60.5% and 43.3%, respectively over a baseline version.
Albino, Jean Carlo de Campos
1994-11-01
The basic hypercube queuing model, as well its approximated version, are normally used to analyze public emergency systems`performance. Both models, however, can be used only when the system has no geographical limitation of its Emergency Mobile Units (EMU), i.e, in systems in which the region`s atoms can be attended by any EMU, meaning that each EMU can over the entirety of the analysed region. In systems that does not present this property, one could only use simulation techniques. This paper presents an adaptation of the basic hypercube model that was developed for cases of urban emergency systems in which the EMU`s are not allowed or not worthwhile to cover the entire region. This adapted version is called Limited Hypercube Model (LHM). Modifications have been made in several steps of the basic model, and has brought a new analysis perspective, by amplifying the hypercube`s application universe and allowing mathematical description of some limited systems` specific features. The LHM was applied to analyze an electric energy distribution public service at Florianopolis, and the results are discussed in the paper. (author) 12 refs., 11 figs., 14 tabs.
Levitin, Gregory; Dai Yuanshun; Xie Min; Leng Poh, Kim
2003-01-01
In this paper we consider vulnerable systems which can have different states corresponding to different combinations of available elements composing the system. Each state can be characterized by a performance rate, which is the quantitative measure of a system's ability to perform its task. Both the impact of external factors (stress) and internal causes (failures) affect system survivability, which is determined as probability of meeting a given demand. In order to increase the survivability of the system, a multi-level protection is applied to its subsystems. This means that a subsystem and its inner level of protection are in their turn protected by the protection of an outer level. This double-protected subsystem has its outer protection and so forth. In such systems, the protected subsystems can be destroyed only if all of the levels of their protection are destroyed. Each level of protection can be destroyed only if all of the outer levels of protection are destroyed. We formulate the problem of finding the structure of series-parallel multi-state system (including choice of system elements, choice of structure of multi-level protection and choice of protection methods) in order to achieve a desired level of system survivability by the minimal cost. An algorithm based on the universal generating function method is used for determination of the system survivability. A multi-processor version of genetic algorithm is used as optimization tool in order to solve the structure optimization problem. An application example is presented to illustrate the procedure presented in this paper
Hypercube Solutions for Conjugate Directions
1991-12-01
rate Mb Bt C LinkO Linki Link2 Link3 3 57 ; 0 T414b-16 0.09 0 [ HOST 1:1 2:1 3:2 3 58 ; 1 T800c-20 0.80 1 [ 4:3 0:1 5:1 6:0) 59 ; 2 T2 -17 0.49 1 [ C004...LINK CONNECTIONS 114 ; NODE LOADABLE COMES 115 ; ID CODE (.tld) FROM: LIWKO LINKI LINK2 LIMK3 116 = 117 1, rootcode, rO, 0, 2, , 10; B004 118 2... LINKI then node k 559 * knows to receive messages from node n via LINKII. The element 560 * ic[CUBESIZE holds the channel for the root node (if
Fernando César Mendonça
2000-04-01
Full Text Available Em um sistema médico-emergencial, o nível de serviço é fundamental, pois a qualidade de atendimento poderá ser a diferença entre a vida e a morte do usuário. Além da qualidade do atendimento, um dos componentes mais importantes do nível de serviço é o tempo médio de resposta a uma chamada emergencial. Entretanto, adquirir equipamentos e/ou treinamentos para a equipe, para reduzir este tempo de resposta, incorre em custos e investimentos adicionais. Nesta situação, aparece um importante trade-off entre o nível de serviço e o investimento no sistema. Para auxiliar a análise deste trade-off, há um modelo conhecido da literatura chamado hipercubo, que é baseado em teoria de filas espacialmente distribuídas. Neste artigo, adaptamos e aplicamos o modelo hipercubo para analisar os Anjos do Asfalto, um sistema médico-emergencial localizado ao longo da Rodovia Presidente Dutra entre as cidades do Rio de Janeiro e São Paulo. Os resultados mostraram que o modelo hipercubo pode ser uma ferramenta efetiva de apoio às decisões estratégicas e operacionais em sistemas emergenciais em rodovias.In an emergency medical system, the quality of the service is fundamental since this might be the difference between life and death of the user. Beyond the attendance quality, one of the most important components of the service level is the mean response time to an emergency call. However, acquiring equipment and training for the team to reduce this time imply additional costs and investments. In this case, an important trade-off appears between the service level and the investment in the system. To analyze this trade-off, there is a well-known model named hypercube, which is based in spatially distributed queueing theory. In this paper, we adapt and apply the hypercube model to analyze the "Anjos do Asfalto", an emergency medical system on the highway Presidente Dutra between the cities of Rio de Janeiro and São Paulo, in Brazil. The results
Fernando Chiyoshi
2000-08-01
Full Text Available O modelo hipercubo é revisitado tendo em vista sua utilização em métodos de solução para problemas de localização probabilísticos. Este uso do modelo é de bastante relevância em situações em que a aleatoriedade na disponibilidade dos servidores é um fator importante a ser considerado; em algumas circunstâncias esta aleatoriedade só pode ser representada pela modelagem de filas espacialmente distribuídas.O modelo é apresentado com o auxílio de um exemplo ilustrativo, para o qual são derivadas as equações de equilíbrio; medidas de desempenho do modelo são também definidas. Isto é seguido pela descrição de um método exato e de outro aproximado para o cálculo destas medidas. Diversos modelos de localização probabilísticos são então estudados, o que é seguido pela análise de métodos de solução disponíveis para esses modelos, com ênfase especial nos métodos que incluem o uso do modelo hipercubo. Embora atualmente de uso incipiente em problemas de localização probabilísticos, o modelo tem grande potencial nesse contexto, por exemplo se integrado a metaheurísticas tais como simulated annealing e busca tabu.The hypercube model is revisited regarding its use in solution methods for probabilistic location problems. This use of the model is relevant in situations in which the randomness in the availability of servers is an important factor to be considered; in some circumstances this randomness can be represented by spatially distributed queues. The model is presented through an illustrative example, for which the equilibrium equations are derived; some measures of performance are also defined. This is followed by the description of an exact and an approximate method for the calculation of these measures. Several probabilistic location models are then studied, which is followed by the analysis of solution methods for these models, with special emphasis given to methods that embed the hypercube model. Although
Ana Paula Iannoni
2006-04-01
Full Text Available O modelo hipercubo, conhecido na literatura de problemas de localização de sistemas servidor para cliente, é um modelo baseado em teoria de filas espacialmente distribuídas e aproximações Markovianas. O modelo pode ser modificado para analisar os sistemas de atendimentos emergenciais (SAEs em rodovias, considerando as particularidades da política de despacho destes sistemas. Neste estudo, combinou-se o modelo hipercubo com um algoritmo genético para otimizar a configuração e operação de SAEs em rodovias. A abordagem é efetiva para apoiar decisões relacionadas ao planejamento e operação destes sistemas, por exemplo, em determinar o tamanho ideal para as áreas de cobertura de cada ambulância, de forma a minimizar o tempo médio de resposta aos usuários e o desbalanceamento das cargas de trabalho das ambulâncias. Os resultados computacionais desta abordagem foram analisados utilizando dados reais do sistema Anjos do Asfalto (rodovia Presidente Dutra.The hypercube model, well-known in the literature on problems of server-to-customer localization systems, is based on the spatially distributed queuing theory and Markovian analysis approximations. The model can be modified to analyze Emergency Medical Systems (EMSs on highways, considering the particularities of these systems' dispatching policies. In this study, we combine the hypercube model with a genetic algorithm to optimize the configuration and operation of EMSs on highways. This approach is effective to support planning and operation decisions, such as determining the ideal size of the area each ambulance should cover to minimize not only the average time of response to the user but also ambulance workload imbalances, as well as generating a Pareto efficient boundary between these measures. The computational results of this approach were analyzed using real data Anjos do Asfalto EMS (which covers the Presidente Dutra highway.
Sergio Briguglio
2003-01-01
Full Text Available A performance-prediction model is presented, which describes different hierarchical workload decomposition strategies for particle in cell (PIC codes on Clusters of Symmetric MultiProcessors. The devised workload decomposition is hierarchically structured: a higher-level decomposition among the computational nodes, and a lower-level one among the processors of each computational node. Several decomposition strategies are evaluated by means of the prediction model, with respect to the memory occupancy, the parallelization efficiency and the required programming effort. Such strategies have been implemented by integrating the high-level languages High Performance Fortran (at the inter-node stage and OpenMP (at the intra-node one. The details of these implementations are presented, and the experimental values of parallelization efficiency are compared with the predicted results.
Gara, Alan; Ohmacht, Martin
2014-09-16
In a multiprocessor system with at least two levels of cache, a speculative thread may run on a core processor in parallel with other threads. When the thread seeks to do a write to main memory, this access is to be written through the first level cache to the second level cache. After the write though, the corresponding line is deleted from the first level cache and/or prefetch unit, so that any further accesses to the same location in main memory have to be retrieved from the second level cache. The second level cache keeps track of multiple versions of data, where more than one speculative thread is running in parallel, while the first level cache does not have any of the versions during speculation. A switch allows choosing between modes of operation of a speculation blind first level cache.
Langenbuch, S.; Schmidt, K.D.; Velkov, K.
2003-01-01
The OECD/NRC BWR Turbine Trip (TT) Benchmark is investigated to perform code-to-code comparison of coupled codes including a comparison to measured data which are available from turbine trip experiments at Peach Bottom 2. This Benchmark problem for a BWR over-pressure transient represents a challenging application of coupled codes which integrate 3-dimensional neutron kinetics into thermal-hydraulic system codes for best-estimate simulation of plant transients. This transient represents a typical application of coupled codes which are usually performed on powerful workstations using a single CPU. Nowadays, the availability of multi-CPUs is much easier. Indeed, powerful workstations already provide 4 to 8 CPU, computer centers give access to multi-processor systems with numbers of CPUs in the order of 16 up to several 100. Therefore, the performance of the coupled code Athlet-Quabox/Cubbox on multi-processor systems is studied. Different cases of application lead to changing requirements of the code efficiency, because the amount of computer time spent in different parts of the code is varying. This paper presents main results of the coupled code Athlet-Quabox/Cubbox for the extreme scenarios of the BWR TT Benchmark together with evaluations of the code performance on multi-processor computers. (authors)
Kordy, M.; Wannamaker, P.; Maris, V.; Cherkaev, E.; Hill, G.
2016-01-01
Following the creation described in Part I of a deformable edge finite-element simulator for 3-D magnetotelluric (MT) responses using direct solvers, in Part II we develop an algorithm named HexMT for 3-D regularized inversion of MT data including topography. Direct solvers parallelized on large-RAM, symmetric multiprocessor (SMP) workstations are used also for the Gauss-Newton model update. By exploiting the data-space approach, the computational cost of the model update becomes much less in both time and computer memory than the cost of the forward simulation. In order to regularize using the second norm of the gradient, we factor the matrix related to the regularization term and apply its inverse to the Jacobian, which is done using the MKL PARDISO library. For dense matrix multiplication and factorization related to the model update, we use the PLASMA library which shows very good scalability across processor cores. A synthetic test inversion using a simple hill model shows that including topography can be important; in this case depression of the electric field by the hill can cause false conductors at depth or mask the presence of resistive structure. With a simple model of two buried bricks, a uniform spatial weighting for the norm of model smoothing recovered more accurate locations for the tomographic images compared to weightings which were a function of parameter Jacobians. We implement joint inversion for static distortion matrices tested using the Dublin secret model 2, for which we are able to reduce nRMS to ˜1.1 while avoiding oscillatory convergence. Finally we test the code on field data by inverting full impedance and tipper MT responses collected around Mount St Helens in the Cascade volcanic chain. Among several prominent structures, the north-south trending, eruption-controlling shear zone is clearly imaged in the inversion.
Feltrin, Antonio Padilha
1991-05-01
This thesis describes a methodology for decomposing the repeat solution process the equation Ax = b independent tasks to be done in parallel, based on the matrix inverse factors (W-matrix) with partitions. The partitioning scheme proposed in this thesis consists of breaking up the W-matrix according the depths of the factorization path tree. In this scheme, all the information needed to generate the partitions can be obtained straightforward from the network factorization path tree. The partitioning algorithm is simple and ease to implement. The elements of W, matrices, except the last partition, can be obtained directly from L-matrix elements, not requiring extra work. The proposed scheme guarantees that additional fills be only created in the last partition. The forward and backward solutions are performed by rows, and the strategy proposed is to schedule on beach processor the operations corresponding to a row of each partition. It should be kept in mind that the multiprocessor environments are equipped with powerful unit processor and then it seems a sound strategy to perform the multi-add elementary operations inside the hardware in order to exploit its computing efficiency. This strategy seeks to match the parallel algorithm to the parallel architecture. The precedent relations - that give rise to delays - are replaced by multi-add operations performed inside the processor mode without external communication. In the partition, the forward, diagonal and backward solutions may be gathered, and so all the operations can be expressed as the product of matrix W{sub lp} by the update components of vector b. The performance results show that the potential speedup of the solution time is essentially bounded by the floating point operation capability of each processor, denoting that the methodology is a suitable way to exploit the growing power of the computing technology. 46 figs, 40 refs, 49 tabs.
Regiane Máximo de Souza
2013-01-01
Full Text Available Em alguns sistemas de atendimento médico emergencial, a demanda pelo serviço pode ser alta devido ao atendimento a pacientes em diferentes estados, desde mais graves até mais leves. Nesses sistemas, pode haver formação de filas de usuários aguardando atendimento, e a necessidade de se considerar explicitamente políticas de prioridade nesse atendimento torna-se importante. Neste trabalho propõe-se uma extensão do clássico modelo hipercubo de filas espacialmente distribuídas para considerar fila com prioridade. Para verificar a viabilidade e a aplicabilidade dessa abordagem, utilizam-se dados de um estudo de caso realizado no SAMU de Ribeirão Preto-SP. Foram analisados dois cenários que consideram dois aspectos relevantes: o impacto dos atendimentos de remoção de pacientes e o aumento da demanda nas diversas classes de chamados dos usuários do sistema. O foco é no tempo médio de resposta aos chamados dos usuários, considerado como uma medida de desempenho importante do sistema, principalmente aos chamados de classes com alta prioridade. Os resultados mostram que a abordagem pode ser utilizada para analisar satisfatoriamente sistemas com prioridade de fila.In some emergency medical systems the service demand is high due to the treatment of patients in the range severe to mild. In these systems, may be queues formation and so the need to explicitly consider priority in care is extremely important. In this study we extend the hypercube model to explicitly consider priority queue. In order to verify the feasibility and applicability of this approach, we conducted a case study at Ribeirão Preto´s SAMU (SAMU-RP. We analyzed two alternative scenarios to examine two important issues: the impact of the removals and the effect of increased demand in the different classes of calls of the system. The focus is on the average response time to users, considered as an important performance measure of the system, especially for the high
Parallel-Processing Test Bed For Simulation Software
Blech, Richard; Cole, Gary; Townsend, Scott
1996-01-01
Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).
High-performance computing — an overview
Marksteiner, Peter
1996-08-01
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
Hypercube Expert System Shell - Applying Production Parallelism.
1989-12-01
possible processor organizations, or int( rconntction n thod,, for par- allel architetures . The following are examples of commonlv used interconnection...this timing analysis because match speed-up avaiiah& from production parallelism is proportional to the average number of affected produclions1 ( 11:5
Lattice guage theories on a hypercube computer
Otto, S.W.
1984-01-01
A report on the parallel computer effort underway at Caltech and the use of these machines for lattice gauge theories is given. The computational requirements of the Monte Carlos are, of course, enormous, so high Mflops (Million floating point operations per second) and large memories are required. Various calculations on the machines in regards to their programmability (a non-trivial issue on a parallel computer) and their efficiency in usage of the machine are discussed
VLSI Based Multiprocessor Communications Networks.
1982-09-01
Networks". The contract began on September 1,1980 and was approved on scientific /technical grounds for a duration of three years. Incremental funding was...values for the individual delays will vary from comunicating modules (ij) are shown in Figure 4 module to module due to processing and fabrication
Filin Yuriy Nikolaevich
2013-05-01
Full Text Available The authors state the main aspects of innovative construction of form graphics of the structural geometric model of the Informative Hypercube produced using the universal Protocube-Designer method. The process of construction of the Info-Hypercube model is based on a well-known geometric transformation (motion technique, i.e. spatial displacement of two components of the initial cubic model represented by a pair of trihedrons, which complies with a well-known phenomenon of geometric componenthood (PGC. The use of the Protocube-Designer is used to construct the Info-Hypercube model form graphics stage by stage, on the basis of which the internal structure of the Info-Hypercube is formed. The method of Protocube-Designer makes it possible to reduce the cube model into a plane octagonal structure for the production of the pr-matrix lattice. Then, the required transformation of the plane structure with a pr-matrix into a spatial cubic model is performed. Thus, the aforesaid pratrix is used here as a structural unit for the production of a form graphics space lattice. The final fill-in of the whole internal structure of the Info-Hypercube is performed through completion of six additional intersecting planes (plates passing through the central Infocube in accordance with a typical one-side form graphics obtained earlier. The total number of planes in the internal Info-Hypercube structure will be equal to 18 (6 planes restricting the cubic form and 12 planes intersecting inside the model. As a result of this visual graphic construction, a rational formalized geometric Info-Hypercube model is obtained. This model represents an informative form graphics structure. The model is used in different fields, including constructive geometry, shaping of structural design elements as well as design of modern buildings and engineering structures.Изложены основные аспекты инновационного построения фо
Pthreads vs MPI Parallel Performance of Angular-Domain Decomposed S
Azmy, Y.Y.; Barnett, D.A.
2000-01-01
Two programming models for parallelizing the Angular Domain Decomposition (ADD) of the discrete ordinates (S n ) approximation of the neutron transport equation are examined. These are the shared memory model based on the POSIX threads (Pthreads) standard, and the message passing model based on the Message Passing Interface (MPI) standard. These standard libraries are available on most multiprocessor platforms thus making the resulting parallel codes widely portable. The question is: on a fixed platform, and for a particular code solving a given test problem, which of the two programming models delivers better parallel performance? Such comparison is possible on Symmetric Multi-Processors (SMP) architectures in which several CPUs physically share a common memory, and in addition are capable of emulating message passing functionality. Implementation of the two-dimensional,(S n ), Arbitrarily High Order Transport (AHOT) code for solving neutron transport problems using these two parallelization models is described. Measured parallel performance of each model on the COMPAQ AlphaServer 8400 and the SGI Origin 2000 platforms is described, and comparison of the observed speedup for the two programming models is reported. For the case presented in this paper it appears that the MPI implementation scales better than the Pthreads implementation on both platforms
A parallel row-based algorithm for standard cell placement with integrated error control
Sargent, Jeff S.; Banerjee, Prith
1989-01-01
A new row-based parallel algorithm for standard-cell placement targeted for execution on a hypercube multiprocessor is presented. Key features of this implementation include a dynamic simulated-annealing schedule, row-partitioning of the VLSI chip image, and two novel approaches to control error in parallel cell-placement algorithms: (1) Heuristic Cell-Coloring; (2) Adaptive Sequence Length Control.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.
2003-01-01
Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.
Wilson Mauricio Chicaiza
2013-11-01
Full Text Available En el presente documento se presenta una breve caracterización de los medios de comunicación empleados en arquitecturas multiprocesadas. Esta caracterización tiene como objetivo principal el mostrar un nuevo modelo de comunicación basado en conmutación de paquetes a los cuales se les denomina como Networks-On-Chip (NoC. Esta publicación muestra una arquitectura de red llamada NoC Hermes, la cual fue interconectada a un Multiprocessor-Systems-on-Chip (MPSoC compuesto de cuatro procesadores MicroBlaze. Está conexión se la realizó gracias al diseño y desarrollo de una Interfaz de Red generada en código VHDL. Por medio de la Interfaz de Red se consiguió que los procesadores MicroBlaze interactúen con los Switches de Hermes a fin de crear una arquitectura multiprocesada interconectada por una NoC. Con el motivo de realizar comparaciones también se creó otra arquitectura de multiprocesadores interconectados por buses. Para ambas arquitecturas se desarrolló una aplicación de Esteganografía enla que existe multiprocesamiento de dos procesadores trabajando simultáneamente. Lamentablemente sobre dicha aplicación no fue posible medir directamente la latencia y el consumo de energía, razón por la cual se utilizó simuladores que permitieron estimar dichas mediciones.
Automatic code generation for distributed robotic systems
Jones, J.P.
1993-01-01
Hetero Helix is a software environment which supports relatively large robotic system development projects. The environment supports a heterogeneous set of message-passing LAN-connected common-bus multiprocessors, but the programming model seen by software developers is a simple shared memory. The conceptual simplicity of shared memory makes it an extremely attractive programming model, especially in large projects where coordinating a large number of people can itself become a significant source of complexity. We present results from three system development efforts conducted at Oak Ridge National Laboratory over the past several years. Each of these efforts used automatic software generation to create 10 to 20 percent of the system
A heterogeneous multi-core platform for low power signal processing in systems-on-chip
Paker, Ozgun; Sparsø, Jens; Haandbæk, Niels
2002-01-01
is based on message passing. The mini-cores are designed as parameterized soft macros intended for a synthesis based design flow. A 520.000 transistor 0.25µm CMOS prototype chip containing 6 mini-cores has been fabricated and tested. Its power consumption is only 50% higher than a hardwired ASIC and more......This paper presents a low-power and programmable DSP architecture - a heterogeneous multiprocessor platform consisting of standard CPU/DSP cores, and a set of simple instruction set processors called mini-cores each optimized for a particular class of algorithm (FIR, IIR, LMS, etc.). Communication...
Application Portable Parallel Library
Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott
1995-01-01
Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.
The fast Amsterdam multiprocessor (FAMP) operation system
Gosman, D.; Hertzberger, L.O.; Holthuizen, D.J.; Por, G.J.A.; Schoorel, M.
1981-01-01
The Fast Amsterdam Multi Processor system (FAMP system) is developed for on-line filtering and second stage triggering. The system is based on the MC 68000 microprocessor from MOTOROLA. In this report we will describe: The FAMP operating system software, the features of the slaves and supervisor in the FAMP operating system, the communication between supervisor and slaves using the dual port memories, the communication between user programs and the operating system. The hardware as well as the application of the system will be described elsewhere. (orig.)
External parallel sorting with multiprocessor computers
Comanceau, S.I.
1984-01-01
This article describes methods of external sorting in which the entire main computer memory is used for the internal sorting of entries, forming out of them sorted segments of the greatest possible size, and outputting them to external memories. The obtained segments are merged into larger segments until all entries form one ordered segment. The described methods are suitable for sequential files stored on magnetic tape. The needs of the sorting algorithm can be met by using the relatively slow peripheral storage devices (e.g., tapes, disks, drums). The efficiency of the external sorting methods is determined by calculating the total sorting time as a function of the number of entries to be sorted and the number of parallel processors participating in the sorting process
Communications Patterns in a Symbolic Multiprocessor.
1987-06-01
instruction references that Multilisp programs make. The cache hit ratio is greatest when instruction references have a high degree of -- locality. Another...future touches hit an undetermined future. N, The only exception is Consim, in which one third of future touches hit unde- termined futures. Task...Cambridge, MA, June 1985. [52] S. Sugimoto, K. Agusa, K. Tabata , and Y. Ohno. A multi-microprocessor system for concurrent Lisp. In Proceedings of
Decay-usage scheduling in multiprocessors
Epema, D.H.J.
1998-01-01
Decay-usage scheduling is a priority-aging time-sharing scheduling policy capable of dealing with a workload of both interactive and batch jobs by decreasing the priority of a job when it acquires CPU time, and by increasing its priority when it does not use the (a) CPU. In this article we deal with
Architecture of an acquisition system-multiprocessors
Postec, H.
1987-07-01
To follow the huge increasing of concerned parameters in nuclear detection systems, acquisition systems become bigger and have to present very good rapidity performance. At Ganil, four detection systems have been set in Nautilus reaction chamber, that lead to experiment configurations with 700 parameters to process. In front of present acquisition system limitation, a device more relevant to lecture of a large number of channels show off necessary. Functionalities already operating in other systems and hardware already used have been chosen; specific technical solutions were aldo developed to use the most recent techniques and to take in account the four detection system structure of the device [fr
Efficient thermal management for multiprocessor systems
Coşkun, Ayşe Kıvılcım
2009-01-01
High temperatures and large thermal variations on the die create severe challenges in system reliability, performance, leakage power, and cooling costs. Designing for worst-case thermal conditions is highly costly and time-consuming. Therefore, dynamic thermal management methods are needed to maintain safe temperature levels during execution. Conventional management techniques sacrifice performance to control temperature and only consider the hot spots, neglecting the effects of thermal varia...
Portable Instrumented Communication Library
Geist, G.A.; Heath, M.T.; Peyton, B.W.; Worley, P.H.
2001-01-01
1 - Description of program or function: PICL is a subroutine library that can be used to develop parallel programs that are portable across several distributed-memory multi-processors. PICL provides a portable syntax for key communication primitives and related system calls. It also provides portable routines to perform certain widely- used, high-level communication operations, such as global broadcast and global summation. PICL provides execution tracing that can be used to monitor performance or to aid in debugging. 2 - Restrictions on the complexity of the problem: PICL is a compatibility library built on top of the native multiprocessor operating system and message passing primitives. Thus, the portability of PICL programs is not guaranteed, being a function of idiosyncrasies of the different platforms. Predictable differences are captured with standard error trapping routines. PICL is a research tool, not a production software system
Quealy, Angela; Cole, Gary L.; Blech, Richard A.
1993-01-01
The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.
Linear algebra for dense matrices on a hypercube
Sears, M.P.
1990-01-01
A set of routines has been written for dense matrix operations optimized for the NCUBE/6400 parallel processor. This paper was motivated by a Sandia effort to parallelize certain electronic structure calculations. Routines are included for matrix transpose, multiply, Cholesky decomposition, triangular inversion, and Householder tridiagonalization. The library is written in C and is callable from Fortran. Matrices up to order 1600 can be handled on 128 processors. For each operation, the algorithm used is presented along with typical timings and estimates of performance. Performance for order 1600 on 128 processors varies from 42 MFLOPs (House-holder tridiagonalization, triangular inverse) up to 126 MFLOPs (matrix multiply). The authors also present performance results for communications and basic linear algebra operations (saxpy and dot products)
INTUITEL and the Hypercube Model - Developing Adaptive Learning Environments
Kevin Fuchs
2016-06-01
Full Text Available In this paper we introduce an approach for the creation of adaptive learning environments that give human-like recommendations to a learner in the form of a virtual tutor. We use ontologies defining pedagogical, didactic and learner-specific data describing a learner's progress, learning history, capabilities and the learner's current state within the learning environment. Learning recommendations are based on a reasoning process on these ontologies and can be provided in real-time. The ontologies may describe learning content from any domain of knowledge. Furthermore, we describe an approach to store learning histories as spatio-temporal trajectories and to correlate them with influencing didactic factors. We show how such analysis of spatiotemporal data can be used for learning analytics to improve future adaptive learning environments.
Parallel discrete ordinates algorithms on distributed and common memory systems
Wienke, B.R.; Hiromoto, R.E.; Brickner, R.G.
1987-01-01
The S/sub n/ algorithm employs iterative techniques in solving the linear Boltzmann equation. These methods, both ordered and chaotic, were compared on both the Denelcor HEP and the Intel hypercube. Strategies are linked to the organization and accessibility of memory (common memory versus distributed memory architectures), with common concern for acquisition of global information. Apart from this, the inherent parallelism of the algorithm maps directly onto the two architectures. Results comparing execution times, speedup, and efficiency are based on a representative 16-group (full upscatter and downscatter) sample problem. Calculations were performed on both the Los Alamos National Laboratory (LANL) Denelcor HEP and the LANL Intel hypercube. The Denelcor HEP is a 64-bit multi-instruction, multidate MIMD machine consisting of up to 16 process execution modules (PEMs), each capable of executing 64 processes concurrently. Each PEM can cooperate on a job, or run several unrelated jobs, and share a common global memory through a crossbar switch. The Intel hypercube, on the other hand, is a distributed memory system composed of 128 processing elements, each with its own local memory. Processing elements are connected in a nearest-neighbor hypercube configuration and sharing of data among processors requires execution of explicit message-passing constructs
Applications and development of communication models for the touchstone GAMMA and DELTA prototypes
Seidel, Steven R.
1993-01-01
The goal of this project was to develop models of the interconnection networks of the Intel iPSC/860 and DELTA multicomputers to guide the design of efficient algorithms for interprocessor communication in problems that commonly occur in CFD codes and other applications. Interprocessor communication costs of codes for message-passing architectures such as the iPSC/860 and DELTA significantly affect the level of performance that can be obtained from those machines. This project addressed several specific problems in the achievement of efficient communication on the Intel iPSC/860 hypercube and DELTA mesh. In particular, an efficient global processor synchronization algorithm was developed for the iPSC/860 and numerous broadcast algorithms were designed for the DELTA.
The guide to PAMIR theory and use of Parameterized Adaptive Multidimensional Integration Routines
Adler, Stephen L
2013-01-01
PAMIR (Parameterized Adaptive Multidimensional Integration Routines) is a suite of Fortran programs for multidimensional numerical integration over hypercubes, simplexes, and hyper-rectangles in general dimension p, intended for use by physicists, applied mathematicians, computer scientists, and engineers. The programs, which are available on the internet at website and are free for non-profit research use, are capable of following localized peaks and valleys of the integrand. Each program comes with a Message-Passing Interface (MPI) parallel version for cluster use as well as serial versions. The first chapter presents introductory material, similar to that on the PAMIR website, and the next is a "manual" giving much more detail on the use of the programs than is on the website. They are followed by many examples of performance benchmarks and comparisons with other programs, and a discussion of the computational integration aspects of PAMIR, in comparison with other methods in the literature. The final chapt...
Parallel ray tracing for one-dimensional discrete ordinate computations
Jarvis, R.D.; Nelson, P.
1996-01-01
The ray-tracing sweep in discrete-ordinates, spatially discrete numerical approximation methods applied to the linear, steady-state, plane-parallel, mono-energetic, azimuthally symmetric, neutral-particle transport equation can be reduced to a parallel prefix computation. In so doing, the often severe penalty in convergence rate of the source iteration, suffered by most current parallel algorithms using spatial domain decomposition, can be avoided while attaining parallelism in the spatial domain to whatever extent desired. In addition, the reduction implies parallel algorithm complexity limits for the ray-tracing sweep. The reduction applies to all closed, linear, one-cell functional (CLOF) spatial approximation methods, which encompasses most in current popular use. Scalability test results of an implementation of the algorithm on a 64-node nCube-2S hypercube-connected, message-passing, multi-computer are described. (author)
Parallel community climate model: Description and user`s guide
Drake, J.B.; Flanery, R.E.; Semeraro, B.D.; Worley, P.H. [and others
1996-07-15
This report gives an overview of a parallel version of the NCAR Community Climate Model, CCM2, implemented for MIMD massively parallel computers using a message-passing programming paradigm. The parallel implementation was developed on an Intel iPSC/860 with 128 processors and on the Intel Delta with 512 processors, and the initial target platform for the production version of the code is the Intel Paragon with 2048 processors. Because the implementation uses a standard, portable message-passing libraries, the code has been easily ported to other multiprocessors supporting a message-passing programming paradigm. The parallelization strategy used is to decompose the problem domain into geographical patches and assign each processor the computation associated with a distinct subset of the patches. With this decomposition, the physics calculations involve only grid points and data local to a processor and are performed in parallel. Using parallel algorithms developed for the semi-Lagrangian transport, the fast Fourier transform and the Legendre transform, both physics and dynamics are computed in parallel with minimal data movement and modest change to the original CCM2 source code. Sequential or parallel history tapes are written and input files (in history tape format) are read sequentially by the parallel code to promote compatibility with production use of the model on other computer systems. A validation exercise has been performed with the parallel code and is detailed along with some performance numbers on the Intel Paragon and the IBM SP2. A discussion of reproducibility of results is included. A user`s guide for the PCCM2 version 2.1 on the various parallel machines completes the report. Procedures for compilation, setup and execution are given. A discussion of code internals is included for those who may wish to modify and use the program in their own research.
A parallel neural network training algorithm for control of discrete dynamical systems.
Gordillo, J. L.; Hanebutte, U. R.; Vitela, J. E.
1998-01-20
In this work we present a parallel neural network controller training code, that uses MPI, a portable message passing environment. A comprehensive performance analysis is reported which compares results of a performance model with actual measurements. The analysis is made for three different load assignment schemes: block distribution, strip mining and a sliding average bin packing (best-fit) algorithm. Such analysis is crucial since optimal load balance can not be achieved because the work load information is not available a priori. The speedup results obtained with the above schemes are compared with those corresponding to the bin packing load balance scheme with perfect load prediction based on a priori knowledge of the computing effort. Two multiprocessor platforms: a SGI/Cray Origin 2000 and a IBM SP have been utilized for this study. It is shown that for the best load balance scheme a parallel efficiency of over 50% for the entire computation is achieved by 17 processors of either parallel computers.
Parallelism at Cern: real-time and off-line applications in the GP-MIMD2 project
Calafiura, P.
1997-01-01
A wide range of general purpose high-energy physics applications, ranging from Monte Carlo simulation to data acquisition, from interactive data analysis to on-line filtering, have been ported, or developed, and run in parallel on IBM SP-2 and Meiko CS-2 CERN large multi-processor machines. The ESPRIT project GP-MIMD2 has been a catalyst for the interest in parallel computing at CERN. The project provided the 128 processor Meiko CS-2 system that is now succesfully integrated in the CERN computing environment. The CERN experiment NA48 was involved in the GP-MIMD2 project since the beginning. NA48 physicists run, as part of their day-to-day work, simulation and analysis programs parallelized using the message passing interface MPI. The CS-2 is also a vital component of the experiment data acquisition system and will be used to calibrate in real-time the 13000 channels liquid krypton calorimeter. (orig.)
Managing coherence via put/get windows
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2011-01-11
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications
Biswas, Rupak; Das, Sajal K.; Harvey, Daniel; Oliker, Leonid
1999-01-01
The ability to dynamically adapt an unstructured -rid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from the view point of portability on various multiprocessor platforms We address this problem by developing PLUM, tin automatic anti architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on, an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.
DOLIB: Distributed Object Library
D' Azevedo, E.F.
1994-01-01
This report describes the use and implementation of DOLIB (Distributed Object Library), a library of routines that emulates global or virtual shared memory on Intel multiprocessor systems. Access to a distributed global array is through explicit calls to gather and scatter. Advantages of using DOLIB include: dynamic allocation and freeing of huge (gigabyte) distributed arrays, both C and FORTRAN callable interfaces, and the ability to mix shared-memory and message-passing programming models for ease of use and optimal performance. DOLIB is independent of language and compiler extensions and requires no special operating system support. DOLIB also supports automatic caching of read-only data for high performance. The virtual shared memory support provided in DOLIB is well suited for implementing Lagrangian particle tracking techniques. We have also used DOLIB to create DONIO (Distributed Object Network I/O Library), which obtains over a 10-fold improvement in disk I/O performance on the Intel Paragon.
DOLIB: Distributed Object Library
D`Azevedo, E.F.; Romine, C.H.
1994-10-01
This report describes the use and implementation of DOLIB (Distributed Object Library), a library of routines that emulates global or virtual shared memory on Intel multiprocessor systems. Access to a distributed global array is through explicit calls to gather and scatter. Advantages of using DOLIB include: dynamic allocation and freeing of huge (gigabyte) distributed arrays, both C and FORTRAN callable interfaces, and the ability to mix shared-memory and message-passing programming models for ease of use and optimal performance. DOLIB is independent of language and compiler extensions and requires no special operating system support. DOLIB also supports automatic caching of read-only data for high performance. The virtual shared memory support provided in DOLIB is well suited for implementing Lagrangian particle tracking techniques. We have also used DOLIB to create DONIO (Distributed Object Network I/O Library), which obtains over a 10-fold improvement in disk I/O performance on the Intel Paragon.
Managing coherence via put/get windows
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2012-02-21
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Apparatus for multiprocessor-based control of a multiagent robot
Peters, II, Richard Alan (Inventor)
2009-01-01
An architecture for robot intelligence enables a robot to learn new behaviors and create new behavior sequences autonomously and interact with a dynamically changing environment. Sensory information is mapped onto a Sensory Ego-Sphere (SES) that rapidly identifies important changes in the environment and functions much like short term memory. Behaviors are stored in a DBAM that creates an active map from the robot's current state to a goal state and functions much like long term memory. A dream state converts recent activities stored in the SES and creates or modifies behaviors in the DBAM.
Preconditioning Strategies for Solving Elliptic Difference Equations on a Multiprocessor.
1982-01-01
CALLING MA31C. C TD - TIME REQUIRED BY MA31C TO PERFORM THE C FACTORIZATION. 160 C TDT - TOTAL TIME REQUIRED BY SUBROUTINE FACTOR. C TDT-T IM3-T IMI...TPD-TIM2-TIM1 TD -TIM3-TIM2 165 C WRITE(LP,70) TDT,TPD, TD 70 FORMAT(7H TDT - ,F6.3,7H TPD " F6.3,6H TD ,F6.3) C WRITE(LP, 85) NTYPE,NVERN 170 85 FORMAT...INTEGER INI(IAI) ,INJ(IAJ) ,IK(NN,4) INTEGER NU(3000) C COMMON/EA 4BD /PRVT(4),IPRVT(6) 15 COMMON/MA31I/DD,LP,MP COMMON/MA31J/LROW,LCOL,NCP,ND, IPD
A multiprocessor system for experiments in nuclear physics
Neudold, M.
1986-12-01
In the first part of this report, a provisional version of the data acquisition system for the magnetic spectrometer LITTLE JOHN at the Karlsruhe isochronous cyclotron is described. The conceptual ideas of this first version and the problems appearing during the test of its hardware and software components are discussed. Basing on the experience with this system possible extensions and new busstructures are suggested. In the second part, some prototype programs are described, which have been developed in order to test the ADC-systems and the dataways. (orig.) [de
Fundamental Parallel Algorithms for Private-Cache Chip Multiprocessors
Arge, Lars Allan; Goodrich, Michael T.; Nelson, Michael
2008-01-01
about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present...... two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks....... In addition, we study sorting lower bounds in a computational model, which we call the parallel external-memory (PEM) model, that formalizes the essential properties of our algorithms for private-cache CMPs....
Highway traffic simulation on multi-processor computers
Hanebutte, U.R.; Doss, E.; Tentner, A.M.
1997-04-01
A computer model has been developed to simulate highway traffic for various degrees of automation with a high level of fidelity in regard to driver control and vehicle characteristics. The model simulates vehicle maneuvering in a multi-lane highway traffic system and allows for the use of Intelligent Transportation System (ITS) technologies such as an Automated Intelligent Cruise Control (AICC). The structure of the computer model facilitates the use of parallel computers for the highway traffic simulation, since domain decomposition techniques can be applied in a straight forward fashion. In this model, the highway system (i.e. a network of road links) is divided into multiple regions; each region is controlled by a separate link manager residing on an individual processor. A graphical user interface augments the computer model kv allowing for real-time interactive simulation control and interaction with each individual vehicle and road side infrastructure element on each link. Average speed and traffic volume data is collected at user-specified loop detector locations. Further, as a measure of safety the so- called Time To Collision (TTC) parameter is being recorded.
An Adaptive Hybrid Multiprocessor technique for bioinformatics sequence alignment
Bonny, Talal; Salama, Khaled N.; Zidan, Mohammed A.
2012-01-01
Sequence alignment algorithms such as the Smith-Waterman algorithm are among the most important applications in the development of bioinformatics. Sequence alignment algorithms must process large amounts of data which may take a long time. Here, we
Analysis, design and management of multimedia multiprocessor systems
Kumar, A.
2009-01-01
The design of multimedia platforms is becoming increasingly more complex. Modern multimedia systems need to support a large number of applications or functions in a single device. To achieve high performance in such systems, more and more processors are being integrated into a single chip to build
A multitasking, multisinked, multiprocessor data acquisition front end
Fox, R.; Au, R.; Molen, A.V.
1989-01-01
The authors have developed a generalized data acquisition front end system which is based on MC68020 processors running a commercial real time kernel (rhoSOS), and implemented primarily in a high level language (C). This system has been attached to the back end on-line computing system at NSCL via our high performance ETHERNET protocol. Data may be simultaneously sent to any number of back end systems. Fixed fraction sampling along links to back end computing is also supported. A nonprocedural program generator simplifies the development of experiment specific code
A multiprocessor system for the analysis of cardiac arrhythmias
Jugo, G. Diego J.
1983-01-01
The study presented here forms part of a real-time electrocardiographic signal analysis system project. The medical data analysis system consists of eight independent analyser units, all of which are connected to a controller unit by the standard IEEE-488 bus. Each bed is monitored by an arrhythmias analyser which has the facilities for analysing a patient's electrocardiogram automatically in real time. The measurements may be made in the presence of a pacemaker. The system described presents the technique of a separate and a simultaneous data analysis on two different ECG channels. This permit: a better diagnosis and an optimum use of the bus. (author) [fr
Lower Bounds and Semi On-line Multiprocessor Scheduling
T.C. Edwin Cheng
2003-10-01
Full Text Available We are given a set of identical machines and a sequence of jobs from which we know the sum of the job weights in advance. The jobs have to be assigned on-line to one of the machines and the objective is to minimize the makespan. An algorithm with performance ratio 1.6 and a lower bound of 1.5 is presented. This improves recent results by Azar and Regev who published an algorithm with performance ratio 1.625 for the less general problem that the optimal makespan is known in advance.
Operating system considerations in the multiprocessor MIDAS environment
Weaver, D.; Maples, C.; Meng, J.; Rathbun, W.
1983-01-01
The operating system for MIDAS provides interfaces for custom hardware, debugging facilities, and run time support. The MIDAS architecture uses various specialized hardware devices for controlling the multiple processors and to achieve high I/O throughput. The operating system interfaces with the custom hardware for diagnostics, problem setup, and loading the processors with user code. After the code is loaded, a debugging facility may be used to examine or modify the program in any of the processors, or all the processors simultaneously. During execution of the code, the operating system monitors the processors for exceptional conditions, detects hardware failures, and gathers statistics on performance. This performance information includes histograms depicting instruction execution frequency and analysis of data flow
A Heterogeneous Multiprocessor Graphics System Using Processor-Enhanced Memories
1989-02-01
frames per second, font generation directly from conic spline descriptions, and rapid calculation of radiosity form factors. The hardware consists of...generality for rendering curved surfaces, volume data, objects dcscri id with Constructive Solid Geometry, for rendering scenes using the radiosity ...f.aces and for computing a spherical radiosity lighting model (see Section 7.6). Custom Memory Chips \\ 208 bits x 128 pixels - Renderer Board ix p o a
Eclipse : Flexible Media Processing in a Heterogeneous Multiprocessor Template
Rutten, Martijn
2007-01-01
Multimedia-apparatuur, zoals DVD spelers, digitale televisies en moderne autoradio's bevatten specialistische chips die een groot deel van de functionaliteit verzorgen. De toepassingen van deze chips, zoals digitale audio en video, volgen elkaar in snel tempo op. De complexiteit van multimedia chips
An accurate analysis for guaranteed performance of multiprocessor streaming applications
Poplavko, P.
2008-01-01
Already for more than a decade, consumer electronic devices have been available for entertainment, educational, or telecommunication tasks based on multimedia streaming applications, i.e., applications that process streams of audio and video samples in digital form. Multimedia capabilities are
Models and formal verification of multiprocessor system-on-chips
Brekling, Aske Wiid; Hansen, Michael Reichhardt; Madsen, Jan
2008-01-01
present experimental results on rather small systems with high complexity, primarily due to differences between best-case and worst-case execution times. Considering worst-case execution times only, the system becomes deterministic and using a special version of {Uppaal}, where the no history is saved, we...
Static and dynamic task mapping onto network on chip multiprocessors
Freddy Bolaños-Martínez
2014-01-01
Full Text Available Las redes en circuito integrado (NoC representan un importante paradigma de uso creciente para los sistemas multiprocesador en circuito integrado (MPSoC, debido a su flexibilidad y escalabilidad. Las estrategias de tolerancia a fallos han venido adquiriendo importancia, a medida que los procesos de manufactura incursionan en dimensiones por debajo del micrómetro y la complejidad de los diseños aumenta. Este artículo describe un algoritmo de aprendizaje incremental basado en población (PBIL, orientado a optimizar el proceso de mapeo en tiempo de diseño, así como a encontrar soluciones de mapeo óptimas en tiempo de ejecución, para hacer frente a fallos de único nodo en la red. En ambos casos, los objetivos de optimización corresponden al tiempo de ejecución de las aplicaciones y al ancho de banda pico que aparece en la red. Las simulaciones se basaron en un algoritmo de ruteo XY determinístico, operando sobre una topología de malla 2D para la NoC. Los resultados obtenidos son prometedores. El algoritmo propuesto exhibe un desempeño superior a otras técnicas reportadas cuando el tamaño del problema aumenta.
Dynamic scheduling and analysis of real time systems with multiprocessors
M.D. Nashid Anjum
2016-08-01
Full Text Available This research work considers a scenario of cloud computing job-shop scheduling problems. We consider m realtime jobs with various lengths and n machines with different computational speeds and costs. Each job has a deadline to be met, and the profit of processing a packet of a job differs from other jobs. Moreover, considered deadlines are either hard or soft and a penalty is applied if a deadline is missed where the penalty is considered as an exponential function of time. The scheduling problem has been formulated as a mixed integer non-linear programming problem whose objective is to maximize net-profit. The formulated problem is computationally hard and not solvable in deterministic polynomial time. This research work proposes an algorithm named the Tube-tap algorithm as a solution to this scheduling optimization problem. Extensive simulation shows that the proposed algorithm outperforms existing solutions in terms of maximizing net-profit and preserving deadlines.
Estimating Performance of Single Bus, Shared Memory Multiprocessors
1987-05-01
Chandy78] K.M. Chandy, C.M. Sauer, "Approximate methods for analyzing queuing network models of computing systems," Computing Surveys, vol10 , no 3...Denning78] P. Denning, J. Buzen, "The operational analysis of queueing network models", Computing Sur- veys, vol10 , no 3, September 1978, pp 225-261
DAPHNE: a parallel multiprocessor data acquisition system for nuclear physics
Welch, L.C.
1984-01-01
This paper describes a project to meet these data acquisition needs for a new accelerator, ATLAS, being built at Argonne National Laboratory. ATLAS is a heavy-ion linear superconducting accelerator providing beam energies up to 25 MeV/A with a relative spread in beam energy as good as .0001 and a time spread of less than 100 psec. Details about the hardware front end, command language, data structure, and the flow of event treatment are covered
Benchmarking GNU Radio Kernels and Multi-Processor Scheduling
2013-01-14
Linux, GNU Radio, and file system for target machine • Run benchmarking scripts with VOLK enabled • Run benchmarking scripts with VOLK disabled ...numpy.arange(1,11) pipes3d, stages3d = numpy.meshgrid(pipes, stages) # print results.size 86 # print pipes.size # print stages.size surf = s1.plot_surface...pipes3d, stages3d, results, rstride=1, cstride=1,cmap=cm.jet, linewidth=0, antialiased=True) 90 # f1.colorbar( surf , shrink=0.5, aspect=5) # s1.set_zlim3d
Advancing Software Development for a Multiprocessor System-on-Chip
Stephen Bique
2007-06-01
Full Text Available A low-level language is the right tool to develop applications for some embedded systems. Notwithstanding, a high-level language provides a proper environment to develop the programming tools. The target device is a system-on-chip consisting of an array of processors with only local communication. Applications include typical streaming applications for digital signal processing. We describe the hardware model and stress the advantages of a flexible device. We introduce IDEA, a graphical integrated development environment for an array. A proper foundation for software development is a UML and standard programming abstractions in object-oriented languages.
Numerical simulation of nuclear power plants on multiprocessor systems
Boeer, R.; Finnemann, H.; Mueller, R.; Krapf, J.
1990-04-01
The present final report gives an account of work performed and results obtained at Siemens/KWU within the frame of SUPRENUM project. The coupled neutron kinetics - thermal hydraulics program system 3D-SIM for the calculation of steady-state and transient conditions in the core and in the coolant loops of light water reactors has been developed. Making efficient use of the parallel computing architecture required the implementation of novel, purpose-built software concepts which can be characterized by multigrid algorithms and parallelization by domain decomposition. The modules treating the reactor core are 3D-NEUT for neutron kinetics and 3D-THERM for thermal hydraulics, the remaining system and control components being represented by the module NLOOP. This modular structure is a main feature of 3D-SIM which allows a flexible choice of applications, employing modules individually or in combinations to fulfil a specific calculation task. In particular, 3D-SIM reactor core representation is fully 3-dimensional for both neutron diffusion and thermal-hydraulic equations, allowing coupled calculations and evaluation of crossflow between fuel elements or subchannels. Including also the loops and plant models, 3D-SIM is a powerful analytical tool for simulating pressurized plant response to a wide spectrum of events. (orig.) With 25 refs., 11 figs [de
VIRTUS: a multi-processor system in FASTBUS
Ellett, J.; Jackson, R.; Ritter, R.; Schlein, P.; Yaeger, D.; Zweizig, J.
1986-01-01
VIRTUS is a system of parallel MC68000-based processors interconnected by FASTBUS that is used either on-line as an intelligent trigger component or off-line for full event processing. Each processor receives the complete set of data from one event. The host computer, a VAX 11/780, down-line loads all software to the processors, controls and monitors the functioning of all processors, and writes processed data to tape. Instructions, programs, and data are transferred among the processors and the host in the form of fixed format, variable length data blocks. (Auth.)
Massively parallel mathematical sieves
Montry, G.R.
1989-01-01
The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.
Kirk, B.L.; Azmy, Y.Y.
1992-01-01
In this paper the one-group, steady-state neutron diffusion equation in two-dimensional Cartesian geometry is solved using the nodal integral method. The discrete variable equations comprise loosely coupled sets of equations representing the nodal balance of neutrons, as well as neutron current continuity along rows or columns of computational cells. An iterative algorithm that is more suitable for solving large problems concurrently is derived based on the decomposition of the spatial domain and is accelerated using successive overrelaxation. This algorithm is very well suited for parallel computers, especially since the spatial domain decomposition occurs naturally, so that the number of iterations required for convergence does not depend on the number of processors participating in the calculation. Implementation of the authors' algorithm on the Intel iPSC/2 hypercube and Sequent Balance 8000 parallel computer is presented, and measured speedup and efficiency for test problems are reported. The results suggest that the efficiency of the hypercube quickly deteriorates when many processors are used, while the Sequent Balance retains very high efficiency for a comparable number of participating processors. This leads to the conjecture that message-passing parallel computers are not as well suited for this algorithm as shared-memory machines
The Scalable Coherent Interface and related standards projects
Gustavson, D.B.
1991-09-01
The Scalable Coherent Interface (SCI) project (IEEE P1596) found a way to avoid the limits that are inherent in bus technology. SCI provides bus-like services by transmitting packets on a collection of point-to-point unidirectional links. The SCI protocols support cache coherence in a distributed-shared-memory multiprocessor model, message passing, I/O, and local-area-network-like communication over fiber optic or wire links. VLSI circuits that operate parallel links at 1000 MByte/s and serial links at 1000 Mbit/s will be available early in 1992. Several ongoing SCI-related projects are applying the SCI technology to new areas or extending it to more difficult problems. P1596.1 defines the architecture of a bridge between SCI and VME; P1596.2 compatibly extends the cache coherence mechanism for efficient operation with kiloprocessor systems; P1596.3 defines new low-voltage (about 0.25 V) differential signals suitable for low power interfaces for CMOS or GaAs VLSI implementations of SCI; P1596.4 defines a high performance memory chip interface using these signals; P1596.5 defines data transfer formats for efficient interprocessor communication in heterogeneous multiprocessor systems. This paper reports the current status of SCI, related standards, and new projects. 16 refs
Parallelization of a Monte Carlo particle transport simulation code
Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
2010-05-01
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
A Heuristic Method of Optimal Generalized Hypercube Encoding for Pictorial Databases.
1980-01-15
Sqhm*w1*d~WZ ’Nfr ~ K >~ V January15, lfl *1 ~’>L-N; %j4~S ~ C- (*4Th ~4,.,* - <C ~fl2~ ~tS-ri*’.v **’~~ >, 4. 4-. ? 7 V A > - ~ W>4~,,.. * ~S td ...IF I SUM .6?. I I PSUm a PSUM + SUM *SUM ,2. P~to = PSLIM L53 . RETURN £54. so CALL PF4APII.MI.TEMPRI 155. G070 20 156o.N £57. CCCCCCCCCC 158. C 159. C
A note on a hyper-cubic Mahler measure and associated Bessel integral
Glasser, M L
2012-01-01
The Mahler measure for the n-variable polynomial k + ∑(x j + 1/x j ) is reduced to a single integral of the n-th power of the modified Bessel function I 0 . Several special cases are examined in detail. This article is part of ‘Lattice models and integrability’, a special issue of Journal of Physics A: Mathematical and Theoretical in honour of F Y Wu's 80th birthday. (paper)
Plasma turbulence calculations on the Intel iPSC/860 (rx) hypercube
Lynch, V.E.; Ruiter, J.R.
1990-01-01
One approach to improving the real-time efficiency of plasma turbulence calculations is to use a parallel algorithm. A serial algorithm used for plasma turbulence calculations was modified to allocate a radial region in each node. In this way, convolutions at a fixed radius are performed in parallel, and communication is limited to boundary values for each radial region. For a semi-implicity numerical scheme (tridiagonal matrix solver), there is a factor of 3 improvement in efficiency with the Intel iPSC/860 machine using 64 processors over a single-processor Cray-II. For block-tridiagonal matrix cases (fully implicit code), a second parallelization takes place. The Fourier components are distributed in nodes. In each node, the block-tridiagonal matrix is inverted for each of allocated Fourier components. The algorithm for this second case has not yet been optimized. 10 refs., 4 figs
Matrix multiplication with a hypercube algorithm on multi-core processor cluster
José Crispín Zavala-Díaz
2015-01-01
Full Text Available Se analiza, modifica e implementa el algoritmo de multiplicación de matrices de Dekel, Nassimi y Sahani o hipercubo en un cluster de procesadores multi-core, donde el número de procesadores utilizado es menor al requerido por el algoritmo de n3. Se utilizan 23, 43 y 83 unidades procesadoras para multiplicar matrices de orden de magnitud de 10X10, 102X102 y 103X103. Los resultados del modelo matemático del algoritmo modificado y los obtenidos de la experimentación computacional muestran que es posible alcanzar rapidez y eficiencias paralelas aceptables, en función del número de unidades procesadoras utilizadas. También se muestra que la influencia del enlace externo de comunicación entre los nodos disminuye si se utiliza una combinación de los canales de comunicación disponibles entre los núcleos en un cluster multi-core.
Constructing snake-in-the-box codes and families of such codes covering the hypercube
Haryanto, L.
2007-01-01
A snake-in-the-box code (or snake) is a list of binary words of length n such that each word differs from its successor in the list in precisely one bit position. Moreover, any two words in the list differ in at least two positions, unless they are neighbours in the list. The list is considered to
Sets of disjoint snakes based on a Reed-Muller code and covering the hypercube
Van Zanten, A.J.; Haryanto, L.
2008-01-01
A snake-in-the-box code (or snake) of word length n is a simple circuit in an n-dimensional cube Q n , with the additional property that any two non-neighboring words in the circuit differ in at least two positions. To construct such snakes a straightforward, non-recursive method is developed based
Search on a hypercubic lattice using a quantum random walk. II. d=2
Patel, Apoorva; Raghunathan, K. S.; Rahaman, Md. Aminoor
2010-01-01
We investigate the spatial search problem on the two-dimensional square lattice, using the Dirac evolution operator discretized according to the staggered lattice fermion formalism. d=2 is the critical dimension for the spatial search problem, where infrared divergence of the evolution operator leads to logarithmic factors in the scaling behavior. As a result, the construction used in our accompanying article [A. Patel and M. A. Rahaman, Phys. Rev. A 82, 032330 (2010)] provides an O(√(N)lnN) algorithm, which is not optimal. The scaling behavior can be improved to O(√(NlnN)) by cleverly controlling the massless Dirac evolution operator by an ancilla qubit, as proposed by Tulsi [Phys. Rev. A 78, 012310 (2008)]. We reinterpret the ancilla control as introduction of an effective mass at the marked vertex, and optimize the proportionality constants of the scaling behavior of the algorithm by numerically tuning the parameters.
Application of multi-thread computing and domain decomposition to the 3-D neutronics Fem code Cronos
Ragusa, J.C.
2003-01-01
The purpose of this paper is to present the parallelization of the flux solver and the isotopic depletion module of the code, either using Message Passing Interface (MPI) or OpenMP. Thread parallelism using OpenMP was used to parallelize the mixed dual FEM (finite element method) flux solver MINOS. Investigations regarding the opportunity of mixing parallelism paradigms will be discussed. The isotopic depletion module was parallelized using domain decomposition and MPI. An attempt at using OpenMP was unsuccessful and will be explained. This paper is organized as follows: the first section recalls the different types of parallelism. The mixed dual flux solver and its parallelization are then presented. In the third section, we describe the isotopic depletion solver and its parallelization; and finally conclude with some future perspectives. Parallel applications are mandatory for fine mesh 3-dimensional transport and simplified transport multigroup calculations. The MINOS solver of the FEM neutronics code CRONOS2 was parallelized using the directive based standard OpenMP. An efficiency of 80% (resp. 60%) was achieved with 2 (resp. 4) threads. Parallelization of the isotopic depletion solver was obtained using domain decomposition principles and MPI. Efficiencies greater than 90% were reached. These parallel implementations were tested on a shared memory symmetric multiprocessor (SMP) cluster machine. The OpenMP implementation in the solver MINOS is only the first step towards fully using the SMPs cluster potential with a mixed mode parallelism. Mixed mode parallelism can be achieved by combining message passing interface between clusters with OpenMP implicit parallelism within a cluster
Application of multi-thread computing and domain decomposition to the 3-D neutronics Fem code Cronos
Ragusa, J.C. [CEA Saclay, Direction de l' Energie Nucleaire, Service d' Etudes des Reacteurs et de Modelisations Avancees (DEN/SERMA), 91 - Gif sur Yvette (France)
2003-07-01
The purpose of this paper is to present the parallelization of the flux solver and the isotopic depletion module of the code, either using Message Passing Interface (MPI) or OpenMP. Thread parallelism using OpenMP was used to parallelize the mixed dual FEM (finite element method) flux solver MINOS. Investigations regarding the opportunity of mixing parallelism paradigms will be discussed. The isotopic depletion module was parallelized using domain decomposition and MPI. An attempt at using OpenMP was unsuccessful and will be explained. This paper is organized as follows: the first section recalls the different types of parallelism. The mixed dual flux solver and its parallelization are then presented. In the third section, we describe the isotopic depletion solver and its parallelization; and finally conclude with some future perspectives. Parallel applications are mandatory for fine mesh 3-dimensional transport and simplified transport multigroup calculations. The MINOS solver of the FEM neutronics code CRONOS2 was parallelized using the directive based standard OpenMP. An efficiency of 80% (resp. 60%) was achieved with 2 (resp. 4) threads. Parallelization of the isotopic depletion solver was obtained using domain decomposition principles and MPI. Efficiencies greater than 90% were reached. These parallel implementations were tested on a shared memory symmetric multiprocessor (SMP) cluster machine. The OpenMP implementation in the solver MINOS is only the first step towards fully using the SMPs cluster potential with a mixed mode parallelism. Mixed mode parallelism can be achieved by combining message passing interface between clusters with OpenMP implicit parallelism within a cluster.
Portable parallel programming in a Fortran environment
May, E.N.
1989-01-01
Experience using the Argonne-developed PARMACs macro package to implement a portable parallel programming environment is described. Fortran programs with intrinsic parallelism of coarse and medium granularity are easily converted to parallel programs which are portable among a number of commercially available parallel processors in the class of shared-memory bus-based and local-memory network based MIMD processors. The parallelism is implemented using standard UNIX (tm) tools and a small number of easily understood synchronization concepts (monitors and message-passing techniques) to construct and coordinate multiple cooperating processes on one or many processors. Benchmark results are presented for parallel computers such as the Alliant FX/8, the Encore MultiMax, the Sequent Balance, the Intel iPSC/2 Hypercube and a network of Sun 3 workstations. These parallel machines are typical MIMD types with from 8 to 30 processors, each rated at from 1 to 10 MIPS processing power. The demonstration code used for this work is a Monte Carlo simulation of the response to photons of a ''nearly realistic'' lead, iron and plastic electromagnetic and hadronic calorimeter, using the EGS4 code system. 6 refs., 2 figs., 2 tabs
Parallel file system performances in fusion data storage
Iannone, F.; Podda, S.; Bracco, G.; Manduchi, G.; Maslennikov, A.; Migliori, S.; Wolkersdorfer, K.
2012-01-01
High I/O flow rates, up to 10 GB/s, are required in large fusion Tokamak experiments like ITER where hundreds of nodes store simultaneously large amounts of data acquired during the plasma discharges. Typical network topologies such as linear arrays (systolic), rings, meshes (2-D arrays), tori (3-D arrays), trees, butterfly, hypercube in combination with high speed data transports like Infiniband or 10G-Ethernet, are the main areas in which the effort to overcome the so-called parallel I/O bottlenecks is most focused. The high I/O flow rates were modelled in an emulated testbed based on the parallel file systems such as Lustre and GPFS, commonly used in High Performance Computing. The test runs on High Performance Computing–For Fusion (8640 cores) and ENEA CRESCO (3392 cores) supercomputers. Message Passing Interface based applications were developed to emulate parallel I/O on Lustre and GPFS using data archival and access solutions like MDSPLUS and Universal Access Layer. These methods of data storage organization are widely diffused in nuclear fusion experiments and are being developed within the EFDA Integrated Tokamak Modelling – Task Force; the authors tried to evaluate their behaviour in a realistic emulation setup.
Parallel file system performances in fusion data storage
Iannone, F., E-mail: francesco.iannone@enea.it [Associazione EURATOM-ENEA sulla Fusione, C.R.ENEA Frascati, via E.Fermi, 45 - 00044 Frascati, Rome (Italy); Podda, S.; Bracco, G. [ENEA Information Communication Tecnologies, Lungotevere Thaon di Revel, 76 - 00196 Rome (Italy); Manduchi, G. [Associazione EURATOM-ENEA sulla Fusione, Consorzio RFX, Corso Stati Uniti, 4 - 35127 Padua (Italy); Maslennikov, A. [CASPUR Inter-University Consortium for the Application of Super-Computing for Research, via dei Tizii, 6b - 00185 Rome (Italy); Migliori, S. [ENEA Information Communication Tecnologies, Lungotevere Thaon di Revel, 76 - 00196 Rome (Italy); Wolkersdorfer, K. [Juelich Supercomputing Centre-FZJ, D-52425 Juelich (Germany)
2012-12-15
High I/O flow rates, up to 10 GB/s, are required in large fusion Tokamak experiments like ITER where hundreds of nodes store simultaneously large amounts of data acquired during the plasma discharges. Typical network topologies such as linear arrays (systolic), rings, meshes (2-D arrays), tori (3-D arrays), trees, butterfly, hypercube in combination with high speed data transports like Infiniband or 10G-Ethernet, are the main areas in which the effort to overcome the so-called parallel I/O bottlenecks is most focused. The high I/O flow rates were modelled in an emulated testbed based on the parallel file systems such as Lustre and GPFS, commonly used in High Performance Computing. The test runs on High Performance Computing-For Fusion (8640 cores) and ENEA CRESCO (3392 cores) supercomputers. Message Passing Interface based applications were developed to emulate parallel I/O on Lustre and GPFS using data archival and access solutions like MDSPLUS and Universal Access Layer. These methods of data storage organization are widely diffused in nuclear fusion experiments and are being developed within the EFDA Integrated Tokamak Modelling - Task Force; the authors tried to evaluate their behaviour in a realistic emulation setup.
Parallel implementation of many-body mean-field equations
Chinn, C.R.; Umar, A.S.; Vallieres, M.; Strayer, M.R.
1994-01-01
We describe the numerical methods used to solve the system of stiff, nonlinear partial differential equations resulting from the Hartree-Fock description of many-particle quantum systems, as applied to the structure of the nucleus. The solutions are performed on a three-dimensional Cartesian lattice. Discretization is achieved through the lattice basis-spline collocation method, in which quantum-state vectors and coordinate-space operators are expressed in terms of basis-spline functions on a spatial lattice. All numerical procedures reduce to a series of matrix-vector multiplications and other elementary operations, which we perform on a number of different computing architectures, including the Intel Paragon and the Intel iPSC/860 hypercube. Parallelization is achieved through a combination of mechanisms employing the Gram-Schmidt procedure, broadcasts, global operations, and domain decomposition of state vectors. We discuss the approach to the problems of limited node memory and node-to-node communication overhead inherent in using distributed-memory, multiple-instruction, multiple-data stream parallel computers. An algorithm was developed to reduce the communication overhead by pipelining some of the message passing procedures
User's guide to the Reliability Estimation System Testbed (REST)
Nicol, David M.; Palumbo, Daniel L.; Rifkin, Adam
1992-01-01
The Reliability Estimation System Testbed is an X-window based reliability modeling tool that was created to explore the use of the Reliability Modeling Language (RML). RML was defined to support several reliability analysis techniques including modularization, graphical representation, Failure Mode Effects Simulation (FMES), and parallel processing. These techniques are most useful in modeling large systems. Using modularization, an analyst can create reliability models for individual system components. The modules can be tested separately and then combined to compute the total system reliability. Because a one-to-one relationship can be established between system components and the reliability modules, a graphical user interface may be used to describe the system model. RML was designed to permit message passing between modules. This feature enables reliability modeling based on a run time simulation of the system wide effects of a component's failure modes. The use of failure modes effects simulation enhances the analyst's ability to correctly express system behavior when using the modularization approach to reliability modeling. To alleviate the computation bottleneck often found in large reliability models, REST was designed to take advantage of parallel processing on hypercube processors.
Dynamic load balancing in a concurrent plasma PIC code on the JPL/Caltech Mark III hypercube
Liewer, P.C.; Leaver, E.W.; Decyk, V.K.; Dawson, J.M.
1990-01-01
Dynamic load balancing has been implemented in a concurrent one-dimensional electromagnetic plasma particle-in-cell (PIC) simulation code using a method which adds very little overhead to the parallel code. In PIC codes, the orbits of many interacting plasma electrons and ions are followed as an initial value problem as the particles move in electromagnetic fields calculated self-consistently from the particle motions. The code was implemented using the GCPIC algorithm in which the particles are divided among processors by partitioning the spatial domain of the simulation. The problem is load-balanced by partitioning the spatial domain so that each partition has approximately the same number of particles. During the simulation, the partitions are dynamically recreated as the spatial distribution of the particles changes in order to maintain processor load balance
Chonggang Xu; Hong S. He; Yuanman Hu; Yu Chang; Xiuzhen Li; Rencang Bu
2005-01-01
Geostatistical stochastic simulation is always combined with Monte Carlo method to quantify the uncertainty in spatial model simulations. However, due to the relatively long running time of spatially explicit forest models as a result of their complexity, it is always infeasible to generate hundreds or thousands of Monte Carlo simulations. Thus, it is of great...
Design of a nano-satellite demonstrator of an infrared imaging space interferometer: the HyperCube
Dohlen, Kjetil; Vives, Sébastien; Rakotonimbahy, Eddy; Sarkar, Tanmoy; Tasnim Ava, Tanzila; Baccichet, Nicola; Savini, Giorgio; Swinyard, Bruce
2014-07-01
The construction of a kilometer-baseline far infrared imaging interferometer is one of the big instrumental challenges for astronomical instrumentation in the coming decades. Recent proposals such as FIRI, SPIRIT, and PFI illustrate both science cases, from exo-planetary science to study of interstellar media and cosmology, and ideas for construction of such instruments, both in space and on the ground. An interesting option for an imaging multi-aperture interferometer with km baseline is the space-based hyper telescope (HT) where a giant, sparsely populated primary mirror is constituted of several free-flying satellites each carrying a mirror segment. All the segments point the same object and direct their part of the pupil towards a common focus where another satellite, containing recombiner optics and a detector unit, is located. In Labeyrie's [1] original HT concept, perfect phasing of all the segments was assumed, allowing snap-shot imaging within a reduced field of view and coronagraphic extinction of the star. However, for a general purpose observatory, image reconstruction using closure phase a posteriori image reconstruction is possible as long as the pupil is fully non-redundant. Such reconstruction allows for much reduced alignment tolerances, since optical path length control is only required to within several tens of wavelengths, rather than within a fraction of a wavelength. In this paper we present preliminary studies for such an instrument and plans for building a miniature version to be flown on a nano satellite. A design for recombiner optics is proposed, including a scheme for exit pupil re-organization, is proposed, indicating the focal plane satellite in the case of a km-baseline interferometer could be contained within a 1m3 unit. Different options for realization of a miniature version are presented, including instruments for solar observations in the visible and the thermal infrared and giant planet observations in the visible, and an algorithm for design of optimal aperture layout based on least-squares minimization is described. A first experimental setup realized by master students is presented, where a 20mm baseline interferometer with 1mm apertures associated with a thermal infrared camera pointed the sun. The absence of fringes in this setup is discussed in terms of spatial spectrum analysis. Finally, we discuss requirements in terms of satellite pointing requirements for such a miniature interferometer.
Simuta-Champo, R.; Herrera-Zamarrón, G. S.
2010-01-01
The Monte Carlo technique provides a natural method for evaluating uncertainties. The uncertainty is represented by a probability distribution or by related quantities such as statistical moments. When the groundwater flow and transport governing equations are solved and the hydraulic conductivity field is treated as a random spatial function, the hydraulic head, velocities and concentrations also become random spatial functions. When that is the case, for the stochastic simulation of groundw...
Gabor, Nathaniel M.
2017-05-01
Van de Waals (vdW) heterostructures - which consist of precisely assembled atomically thin electronic materials - exhibit unusual quantum behavior. These quantum materials-by-design are of fundamental interest in basic scientific research and hold tremendous potential in advanced technological applications. Problematically, the fundamental optoelectronic response in these heterostructures is difficult to access using the standard techniques within the traditions of materials science and condensed matter physics. In the standard approach, characterization is based on the measurement of a small amount of one-dimensional data, which is used to gain a precise picture of the material properties of the sample. However, these techniques are fundamentally lacking in describing the complex interdependency of experimental degrees of freedom in vdW heterostructures. In this talk, I will present our recent experiments that utilize a highly data-intensive approach to gain deep understanding of the infrared photoresponse in vdW heterostructure photodetectors. These measurements, which combine state-of-the-art data analytics and measurement design with fundamentally new device structures and experimental parameters, give a clear picture of electron-hole pair interactions at ultrafast time scales.
P3T+: A Performance Estimator for Distributed and Parallel Programs
T. Fahringer
2000-01-01
Full Text Available Developing distributed and parallel programs on today's multiprocessor architectures is still a challenging task. Particular distressing is the lack of effective performance tools that support the programmer in evaluating changes in code, problem and machine sizes, and target architectures. In this paper we introduce P3T+ which is a performance estimator for mostly regular HPF (High Performance Fortran programs but partially covers also message passing programs (MPI. P3T+ is unique by modeling programs, compiler code transformations, and parallel and distributed architectures. It computes at compile-time a variety of performance parameters including work distribution, number of transfers, amount of data transferred, transfer times, computation times, and number of cache misses. Several novel technologies are employed to compute these parameters: loop iteration spaces, array access patterns, and data distributions are modeled by employing highly effective symbolic analysis. Communication is estimated by simulating the behavior of a communication library used by the underlying compiler. Computation times are predicted through pre-measured kernels on every target architecture of interest. We carefully model most critical architecture specific factors such as cache lines sizes, number of cache lines available, startup times, message transfer time per byte, etc. P3T+ has been implemented and is closely integrated with the Vienna High Performance Compiler (VFC to support programmers develop parallel and distributed applications. Experimental results for realistic kernel codes taken from real-world applications are presented to demonstrate both accuracy and usefulness of P3T+.
Methodologies and Tools for Tuning Parallel Programs: 80% Art, 20% Science, and 10% Luck
Yan, Jerry C.; Bailey, David (Technical Monitor)
1996-01-01
The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. In the past few years, the ubiquitous introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance instrumentation/monitor/tuning technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.
Leamy, Michael J.; Springer, Adam C.
In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
Experiences in the parallelization of the discrete ordinates method using OpenMP and MPI
Pautz, A. [TUV Hannover/Sachsen-Anhalt e.V. (Germany); Langenbuch, S. [Gesellschaft fur Anlagen- und Reaktorsicherheit (GRS) mbH (Germany)
2003-07-01
The method of Discrete Ordinates is in principle parallelizable to a high degree, since the transport 'mesh sweeps' are mutually independent for all angular directions. However, in the well-known production code Dort such a type of angular domain decomposition has to be done on a spatial line-byline basis, causing the parallelism in the code to be very fine-grained. The construction of scalar fluxes and moments requires a large effort for inter-thread or inter-process communication. We have implemented two different parallelization approaches in Dort: firstly, we have used a shared-memory model suitable for SMP (Symmetric Multiprocessor) machines based on the standard OpenMP. The second approach uses the well-known Message Passing Interface (MPI) to establish communication between parallel processes running in a distributed-memory environment. We investigate the benefits and drawbacks of both models and show first results on performance and scaling behaviour of the parallel Dort code. (authors)
An efficient implementation of parallel molecular dynamics method on SMP cluster architecture
Suzuki, Masaaki; Okuda, Hiroshi; Yagawa, Genki
2003-01-01
The authors have applied MPI/OpenMP hybrid parallel programming model to parallelize a molecular dynamics (MD) method on a symmetric multiprocessor (SMP) cluster architecture. In that architecture, it can be expected that the hybrid parallel programming model, which uses the message passing library such as MPI for inter-SMP node communication and the loop directive such as OpenMP for intra-SNP node parallelization, is the most effective one. In this study, the parallel performance of the hybrid style has been compared with that of conventional flat parallel programming style, which uses only MPI, both in cases the fast multipole method (FMM) is employed for computing long-distance interactions and that is not employed. The computer environments used here are Hitachi SR8000/MPP placed at the University of Tokyo. The results of calculation are as follows. Without FMM, the parallel efficiency using 16 SMP nodes (128 PEs) is: 90% with the hybrid style, 75% with the flat-MPI style for MD simulation with 33,402 atoms. With FMM, the parallel efficiency using 16 SMP nodes (128 PEs) is: 60% with the hybrid style, 48% with the flat-MPI style for MD simulation with 117,649 atoms. (author)
Reilly, M H
2008-01-01
Technology and industry trends have clearly shown that the future of technical computing lies in exploitation of more processors in larger multiprocessor systems. Exploitation of high processor count architectures demands a more thorough understanding of the underlying system dynamics and an accounting for them in the design of high-performance applications. Currently these dynamics are incompletely described by the widely adopted benchmarks and kernel metrics. Systems are most often characterized to allow comparisons and ranking. Often the characterizations are in the form of a scalar measure of some aspect of system performance that is a 'not to exceed' number: the maximum possible level of performance that could be attained. While such comparisons typically drive both system design and procurement, more useful characterizations can be used to drive application development and design. This paper explores a few of these measures and presents a few simple examples of their application. The first set of metrics addresses individual processor performance, specifically performance related to memory references. The second set of metrics attempts to describe the behavior of the message-passing system under load and across a range of conditions
Pattern-Driven Automatic Parallelization
Christoph W. Kessler
1996-01-01
Full Text Available This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.
Moriakov, A.; Vasyukhno, V.; Netecha, M.; Khacheresov, G.
2003-01-01
Powerful supercomputers are available today. MBC-1000M is one of Russian supercomputers that may be used by distant way access. Programs LUCKY and LUCKY C were created to work for multi-processors systems. These programs have algorithms created especially for these computers and used MPI (message passing interface) service for exchanges between processors. LUCKY may resolved shielding tasks by multigroup discreet ordinate method. LUCKY C may resolve critical tasks by same method. Only XYZ orthogonal geometry is available. Under little space steps to approximate discreet operator this geometry may be used as universal one to describe complex geometrical structures. Cross section libraries are used up to P8 approximation by Legendre polynomials for nuclear data in GIT format. Programming language is Fortran-90. 'Vector' processors may be used that lets get a time profit up to 30 times. But unfortunately MBC-1000M has not these processors. Nevertheless sufficient value for efficiency of parallel calculations was obtained under 'space' (LUCKY) and 'space and energy' (LUCKY C ) paralleling. AUTOCAD program is used to control geometry after a treatment of input data. Programs have powerful geometry module, it is a beautiful tool to achieve any geometry. Output results may be processed by graphic programs on personal computer. (authors)
Experiences in the parallelization of the discrete ordinates method using OpenMP and MPI
Pautz, A.; Langenbuch, S.
2003-01-01
The method of Discrete Ordinates is in principle parallelizable to a high degree, since the transport 'mesh sweeps' are mutually independent for all angular directions. However, in the well-known production code Dort such a type of angular domain decomposition has to be done on a spatial line-byline basis, causing the parallelism in the code to be very fine-grained. The construction of scalar fluxes and moments requires a large effort for inter-thread or inter-process communication. We have implemented two different parallelization approaches in Dort: firstly, we have used a shared-memory model suitable for SMP (Symmetric Multiprocessor) machines based on the standard OpenMP. The second approach uses the well-known Message Passing Interface (MPI) to establish communication between parallel processes running in a distributed-memory environment. We investigate the benefits and drawbacks of both models and show first results on performance and scaling behaviour of the parallel Dort code. (authors)
New parallel SOR method by domain partitioning
Xie, Dexuan [Courant Inst. of Mathematical Sciences New York Univ., NY (United States)
1996-12-31
In this paper, we propose and analyze a new parallel SOR method, the PSOR method, formulated by using domain partitioning together with an interprocessor data-communication technique. For the 5-point approximation to the Poisson equation on a square, we show that the ordering of the PSOR based on the strip partition leads to a consistently ordered matrix, and hence the PSOR and the SOR using the row-wise ordering have the same convergence rate. However, in general, the ordering used in PSOR may not be {open_quote}consistently ordered{close_quotes}. So, there is a need to analyze the convergence of PSOR directly. In this paper, we present a PSOR theory, and show that the PSOR method can have the same asymptotic rate of convergence as the corresponding sequential SOR method for a wide class of linear systems in which the matrix is {open_quotes}consistently ordered{close_quotes}. Finally, we demonstrate the parallel performance of the PSOR method on four different message passing multiprocessors (a KSR1, the Intel Delta, an Intel Paragon and an IBM SP2), along with a comparison with the point Red-Black and four-color SOR methods.
Nash, T.; Areti, H.; Atac, R.
1988-08-01
Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs
An efficient parallel algorithm for matrix-vector multiplication
Hendrickson, B.; Leland, R.; Plimpton, S.
1993-03-01
The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in the well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.
High performance parallel computers for science
Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.
1989-01-01
This paper reports that Fermilab's Advanced Computer Program (ACP) has been developing cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 Mflops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction
Lashkin, S. V.; Kozelkov, A. S.; Yalozo, A. V.; Gerasimov, V. Yu.; Zelensky, D. K.
2017-12-01
This paper describes the details of the parallel implementation of the SIMPLE algorithm for numerical solution of the Navier-Stokes system of equations on arbitrary unstructured grids. The iteration schemes for the serial and parallel versions of the SIMPLE algorithm are implemented. In the description of the parallel implementation, special attention is paid to computational data exchange among processors under the condition of the grid model decomposition using fictitious cells. We discuss the specific features for the storage of distributed matrices and implementation of vector-matrix operations in parallel mode. It is shown that the proposed way of matrix storage reduces the number of interprocessor exchanges. A series of numerical experiments illustrates the effect of the multigrid SLAE solver tuning on the general efficiency of the algorithm; the tuning involves the types of the cycles used (V, W, and F), the number of iterations of a smoothing operator, and the number of cells for coarsening. Two ways (direct and indirect) of efficiency evaluation for parallelization of the numerical algorithm are demonstrated. The paper presents the results of solving some internal and external flow problems with the evaluation of parallelization efficiency by two algorithms. It is shown that the proposed parallel implementation enables efficient computations for the problems on a thousand processors. Based on the results obtained, some general recommendations are made for the optimal tuning of the multigrid solver, as well as for selecting the optimal number of cells per processor.
A multiprocessor system for the analysis of pictures of nuclear events
Bacilieri, P; Matteuzzi, P; Sini, G P; Zanotti, U
1979-01-01
The pictures of nuclear events obtained from the bubble chambers such as Gargamelle and BEBC at CERN and others from Serpukhov are geometrically processed at CNAF (Centro Nazionale Analysis Photogrammi) in Bologna. The analysis system includes an Erasme table and a CRT flying spot digitizer. The difficulties connected with the pictures of the four stereoscopic views of the bubble chambers are overcome by the choice of a strong interactive system. (0 refs).
Feedback control and beam diagnostic algorithms for a multiprocessor DSP system
Teytelman, D.; Claus, R.; Fox, J.; Hindi, H.; Linscott, I.; Prabhakar, S.
1996-09-01
The multibunch longitudinal feedback system developed for use by PEP-II, ALS and DAΦNE uses a parallel array of digital signal processors to calculate the feedback signals from measurements of beam motion. The system is designed with general-purpose programmable elements which allow many feedback operating modes as well as system diagnostics, calibrations and accelerator measurements. The overall signal processing architecture of the system is illustrated. The real-time DSP algorithms and off-line postprocessing tools are presented. The problems in managing 320 K samples of data collected in one beam transient measurement are discussed and the solutions are presented. Example software structures are presented showing the beam feedback process, techniques for modal analysis of beam motion(used to quantify growth and damping rates of instabilities) and diagnostic functions (such as timing adjustment of beam pick-up and kicker components). These operating techniques are illustrated with example results obtained from the system installed at the Advanced Light Source at LBL
Hardware Locks with Priority Ceiling Emulation for a Java Chip-Multiprocessor
Strøm, Torur Biskopstø; Schoeberl, Martin
2015-01-01
According to the safety-critical Java specification, priority ceiling emulation is a requirement for implementations, as it has preferable properties, such as avoiding priority inversion and being deadlock free on uni-core systems. In this paper we explore our hardware supported implementation...... of priority ceiling emulation on the multicore Java optimized processor, and compare it to the existing hardware locks on the Java optimized processor. We find that the additional overhead for priority ceiling emulation on a multicore processor is several times higher than simpler, non-premptive locks, mainly...
Simulation of Particulate Flows on Multi-Processor Machines with Distributed Memory
Uhlmann, M.
2004-01-01
We present a method for the parallelization of an immersed boundary algorithm for particulate flows using the MPI standard of communication. The treatment of the fluid phase uses the domain decomposition technique over a Cartesian processor grid. The solution of the Hehnholtz problem is approximately factorized an relies upon apparel tri-diagonal solver; the Poisson problem is solved by means of a parallel multi-grid technique simulator MUDPACK. For the solid phase we employ a master-slaves technique where one process or handles all the particles contained in its Eulerian fluid sub-domain and zero or more neighbor processors collaborate in the computation of particle-related quantities whenever a particle position overlaps the boundary of a sub- do mam.The parallel efficiency for some preliminary computations is presented. (Author) 9 refs
I/O-Optimal Distribution Sweeping on Private-Cache Chip Multiprocessors
Ajwani, Deepak; Sitchinava, Nodar; Zeh, Norbert
2011-01-01
/PB) for a number of problems on axis aligned objects; P denotes the number of cores/processors, B denotes the number of elements that fit in a cache line, N and K denote the sizes of the input and output, respectively, and sortp(N) denotes the I/O complexity of sorting N items using P processors in the PEM model...... framework was introduced recently, and a number of algorithms for problems on axis-aligned objects were obtained using this framework. The obtained algorithms were efficient but not optimal. In this paper, we improve the framework to obtain algorithms with the optimal I/O complexity of O(sortp(N) + K...
Simulation-based Modeling Frameworks for Networked Multi-processor System-on-Chip
Mahadevan, Shankar
2006-01-01
the requirements to model the application and the architecture properties independent of the NoC, and then use these applications to successfully validate the approach against a reference cycle-true system. The presence of a standard socket at the intellectual property (IP) and the NoC interface in both the ARTS...
Virtualization and emulation of a CAN device on a multi-processor system on chip
Breaban, G.D.; Koedam, M.L.P.J.; Stuijk, S.; Goossens, K.G.W.
The increasing number of applications implemented on modern vehicles leads to the use of multi-core platforms in the automotive field. As the number of I/O interfaces offered by these platforms is typically lower than the number of integrated applications, a solution is needed to provide access to
Koskas, G.
1988-01-01
This paper describes the new data acquisition and control system of the neutron scattering instruments at the ORPHEE research reactor. The existing system has undergone a complete change: the original CAMAC system and minicomputer controlling each experiment have given way to commercial CPU boards and microcomputers like the IBM-PC. The communication links between these two components are the IEEE 488 or RS232 standards. Emphasis is placed on the flexibility and modular nature of such a system which makes a maximum use of commercial products thus guaranteeing reliability and ease of use. A study of the requirements and evolutions, technical as well as philosophical, is detailed to demonstrate the motivation of the choice of the system architecture. A survey of the various hardware and software achievements and finally an overview of future improvements is given
IEEE P1596, a scalable coherent interface for GigaByte/sec multiprocessor applications
Gustavson, D.B.
1988-11-01
IEEE P1596, the Scalable Coherent Interface (formerly known as SuperBus) is based on experience gained during the development of Fastbus (IEEE 960), Futurebus (IEEE 896.1) and other modern 32-bit buses. SCI goals include a minimum bandwidth of 1 GByte/sec per processor; efficient support of a coherent distributed-cache image of shared memory; and support for segmentation, bus repeaters and general switched interconnections like Banyan, Omega, or full crossbar networks. To achieve these ambitious goals, SCI must sacrifice the immediate handshake characteristic of the present generation of buses in favor of a packet-like split-cycle protocol. Wire-ORs, broadcasts, and even ordinary passive bus structures are to be avoided. However, a lower performance (1 GByte/sec per backplane instead of per processor) implementation using a register insertion ring architecture on a passive ''backplane'' appears to be possible using the same interface as for the more costly switch networks. This paper presents a summary of current directions, and reports the status of the work in progress
HEP - A semaphore-synchronized multiprocessor with central control. [Heterogeneous Element Processor
Gilliland, M. C.; Smith, B. J.; Calvert, W.
1976-01-01
The paper describes the design concept of the Heterogeneous Element Processor (HEP), a system tailored to the special needs of scientific simulation. In order to achieve high-speed computation required by simulation, HEP features a hierarchy of processes executing in parallel on a number of processors, with synchronization being largely accomplished by hardware. A full-empty-reserve scheme of synchronization is realized by zero-one-valued hardware semaphores. A typical system has, besides the control computer and the scheduler, an algebraic module, a memory module, a first-in first-out (FIFO) module, an integrator module, and an I/O module. The architecture of the scheduler and the algebraic module is examined in detail.
Multi-processor system for real-time flow estimation in medical ultrasound imaging
Stetson, Paul F.; Jensen, Jesper Lomborg; Antonius, Peter
1997-01-01
the processed data. The generous bandwidth of the links makes it easy to balance the computational load among the processors.In order to manage the shared system memory and to make use of the parallel processing capabilities of the system, a real-time multitasking kernel has been developed. The kernel uses...
Icarus: A 2-D Direct Simulation Monte Carlo (DSMC) Code for Multi-Processor Computers
BARTEL, TIMOTHY J.; PLIMPTON, STEVEN J.; GALLIS, MICHAIL A.
2001-01-01
Icarus is a 2D Direct Simulation Monte Carlo (DSMC) code which has been optimized for the parallel computing environment. The code is based on the DSMC method of Bird[11.1] and models from free-molecular to continuum flowfields in either cartesian (x, y) or axisymmetric (z, r) coordinates. Computational particles, representing a given number of molecules or atoms, are tracked as they have collisions with other particles or surfaces. Multiple species, internal energy modes (rotation and vibration), chemistry, and ion transport are modeled. A new trace species methodology for collisions and chemistry is used to obtain statistics for small species concentrations. Gas phase chemistry is modeled using steric factors derived from Arrhenius reaction rates or in a manner similar to continuum modeling. Surface chemistry is modeled with surface reaction probabilities; an optional site density, energy dependent, coverage model is included. Electrons are modeled by either a local charge neutrality assumption or as discrete simulational particles. Ion chemistry is modeled with electron impact chemistry rates and charge exchange reactions. Coulomb collision cross-sections are used instead of Variable Hard Sphere values for ion-ion interactions. The electro-static fields can either be: externally input, a Langmuir-Tonks model or from a Green's Function (Boundary Element) based Poison Solver. Icarus has been used for subsonic to hypersonic, chemically reacting, and plasma flows. The Icarus software package includes the grid generation, parallel processor decomposition, post-processing, and restart software. The commercial graphics package, Tecplot, is used for graphics display. All of the software packages are written in standard Fortran
Amitai, Dganit; Averbuch, Amir; Itzikowitz, Samuel; Turkel, Eli
1991-01-01
A major problem in achieving significant speed-up on parallel machines is the overhead involved with synchronizing the concurrent process. Removing the synchronization constraint has the potential of speeding up the computation. The authors present asynchronous (AS) and corrected-asynchronous (CA) finite difference schemes for the multi-dimensional heat equation. Although the discussion concentrates on the Euler scheme for the solution of the heat equation, it has the potential for being extended to other schemes and other parabolic partial differential equations (PDEs). These schemes are analyzed and implemented on the shared memory multi-user Sequent Balance machine. Numerical results for one and two dimensional problems are presented. It is shown experimentally that the synchronization penalty can be about 50 percent of run time: in most cases, the asynchronous scheme runs twice as fast as the parallel synchronous scheme. In general, the efficiency of the parallel schemes increases with processor load, with the time level, and with the problem dimension. The efficiency of the AS may reach 90 percent and over, but it provides accurate results only for steady-state values. The CA, on the other hand, is less efficient, but provides more accurate results for intermediate (non steady-state) values.
Behavior characterization of the shared last-level cache in a chip multiprocessor
Benedicte Illescas, Pedro
2014-01-01
[CATALÀ] Aquest projecte consisteix a analitzar diferents aspectes de la jerarquia de memòria i entendre la seva influència al rendiment del sistema. Els aspectes que s'analitzaran són els algorismes de reemplaçament, els esquemes de mapeig de memòria i les polítiques de pàgina de memòria. [ANGLÈS] This project consists in analyzing different aspects of the memory hierarchy and understanding its influence in the overall system performance. The aspects that will be analyzed are cache replac...
Multicore Processing and ARTEMIS - An incentive to develop the European Multiprocessor research
Seceleanu, Tiberius; Tenhunen, Hannu; Jerraya, Ahmed
2006-01-01
to a traditional SOC view, concurrency at all levels plays a deterministic role, while problems such as power consumption, addressable separately in the nodes of a DS, must be unitary considered. Thus, distinct research and development issues must be defined for MPSOC, building on the indispensable experience...... in DS and SOC, and other related domains. Primarily motivated by market concerns, and also by the promises of the available billion transistor technology, MPSOC is increasingly becoming the preferred target for embedded systems (ES) implementations. Furthermore, the possibility to fit a huge number...
A heterogeneous multiprocessor architecture for low-power audio signal processing applications
Paker, Ozgun; Sparsø, Jens; Haandbæk, Niels
2001-01-01
. The processors are tailored for different classes of filtering algorithms (FIR, IIR, N-LMS etc.), and in a typical system the communication among processors occurs at the sampling rate only. The processors are parameterized in word-size, memory-size, etc. and can be instantiated according to the needs...... of the application at hand using a normal synthesis based ASIC design flow. To give an impression of the size of a processor we mention that one of the FIR processors in a prototype design has 16 instructions, a 32 word×16 bit program memory, a 64 word×16 bit data memory and a 25 word×16 bit coefficient memory....... Early results obtained from the design of a prototype chip containing filter processors for a hearing aid application, indicate a power consumption that is an order of magnitude better than current state of the art low-power audio DSPs implemented using full-custom techniques. This is due to: (1...
A Self Consistent Multiprocessor Space Charge Algorithm that is Almost Embarrassingly Parallel
Nissen, Edward; Erdelyi, B.; Manikonda, S.L.
2012-01-01
We present a space charge code that is self consistent, massively parallelizeable, and requires very little communication between computer nodes; making the calculation almost embarrassingly parallel. This method is implemented in the code COSY Infinity where the differential algebras used in this code are important to the algorithm's proper functioning. The method works by calculating the self consistent space charge distribution using the statistical moments of the test particles, and converting them into polynomial series coefficients. These coefficients are combined with differential algebraic integrals to form the potential, and electric fields. The result is a map which contains the effects of space charge. This method allows for massive parallelization since its statistics based solver doesn't require any binning of particles, and only requires a vector containing the partial sums of the statistical moments for the different nodes to be passed. All other calculations are done independently. The resulting maps can be used to analyze the system using normal form analysis, as well as advance particles in numbers and at speeds that were previously impossible.
Time synchronization for an emulated CAN device on a Multi-Processor System on Chip
Breaban, G.; Koedam, M.; Stuijk, S.; Goossens, K.G.W.
2017-01-01
The increasing number of applications implemented on modern vehicles leads to the use of multi-core platforms in the automotive field. As the number of I/O interfaces offered by these platforms is typically lower than the number of integrated applications, a solution is needed to provide access to
Preemptive scheduling in a two-stage multiprocessor flow shop is NP-hard
Hoogeveen, J.A.; Lenstra, J.K.; Veltman, B.
1996-01-01
In 1954, Johnson gave an efficient algorithm for minimizing makespan in a two-machine flow shop; there is no advantage to preemption in this case. McNaughton's wrap-around rule of 1959 finds a shortest preemptive schedule on identical parallel machines in linear time. A similarly efficient algorithm
Examining the volume efficiency of the cortical architecture in a multi-processor network model.
Ruppin, E; Schwartz, E L; Yeshurun, Y
1993-01-01
The convoluted form of the sheet-like mammalian cortex naturally raises the question whether there is a simple geometrical reason for the prevalence of cortical architecture in the brains of higher vertebrates. Addressing this question, we present a formal analysis of the volume occupied by a massively connected network or processors (neurons) and then consider the pertaining cortical data. Three gross macroscopic features of cortical organization are examined: the segregation of white and gray matter, the circumferential organization of the gray matter around the white matter, and the folded cortical structure. Our results testify to the efficiency of cortical architecture.
3D Reconfigurable NoC Multiprocessor Imaging Interferometer for Space Climate
Dekoulis, George
2016-07-01
This paper describes the development of an imaging interferometer for long-term observations of solar activity related events. Heliospheric physics phenomena are responsible for causing irregularities to the ionospheric-magnetospheric plasmasphere. Distinct signatures of these events are captured and studied over long periods of time deducting crucial conclusions about the short-term Space Weather and in the long run about Space Climate. The new prototype features an eight-channel implementation. The available hardware resources permit a 256- channel configuration for accurate beam scanning of the Earth's plasmasphere. A dual-polarization scheme has been implemented for obtaining accurate measurements. The system is based on state-of-the-art three-dimensional reconfigurable logic and exhibits a performance increase in the range of 70% compared to similar instruments in operation. Special circuits allow measurements of the most intense heliospheric physics events to be fully captured and analyzed.
3D Reconfigurable NoC Multiprocessor Portable Sounder for Plasmaspheric Studies
Dekoulis, George
2016-07-01
The paper describes the development of a prototype imaging sounder for studying the irregularities of the ionospheric plasma. Cutting edge three-dimensional reconfigurable logic has been implemented allowing highly-intensive scientific calculations to be performed in hardware. The new parallel processing algorithms implemented offer a significant amount of performance improvement in the range of 80% compared to existing digital sounder implementations. The current system configuration is taking into consideration the modern scientific needs for portability during scientific campaigns. The prototype acts as a digital signal processing experimentation platform for future larger-scale digital sounder instrumentations for measuring complex planetary plasmaspheric environments.
Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?
1994-01-01
400 MB/second. 4 Dubnicki’s work used trace-driven simulation, with traces collected on an 8-processor machine. We would expect such small-scale...312 1 6 32 64 of odk Sb* Bad64.M Figure 17: Miss rate of Ind Blocked LU. Figure 18: MCPR of Ind Blocked LU. overall miss rate of TGauss is a factor of...easily. 17 (’his approach assunics that the model paramelers we collect from simulations with infinite band- width (such as the miss rate and the
Ahmad, W.; Holzenspies, P.K.F.; Stoelinga, Mariëlle Ida Antoinette; van de Pol, Jan Cornelis
2015-01-01
Execution time is no longer the only performance metric for computer systems. In fact, a trend is emerging to trade raw performance for energy savings. Techniques like Dynamic Power Management (DPM, switching to low power state) and Dynamic Voltage and Frequency Scaling (DVFS, throttling processor
Green computing: power optimisation of vfi-based real-time multiprocessor dataflow applications
Ahmad, W.; Holzenspies, P.K.F.; Stoelinga, Mariëlle Ida Antoinette; van de Pol, Jan Cornelis
2015-01-01
Execution time is no longer the only performance metric for computer systems. In fact, a trend is emerging to trade raw performance for energy savings. Techniques like Dynamic Power Management (DPM, switching to low power state) and Dynamic Voltage and Frequency Scaling (DVFS, throttling processor
Simulation of Particulate Flows Multi-Processor Machines with Distributed Memory
Uhlmann, M.
2004-07-01
We presented a method for the parallelization of an immersed boundary algorithm for particulate flows using the MPI standard of communication. The treatment of the fluid phase used the domain decomposition technique over a Cartesian processor grid. The solution of the Helmholtz problem is approximately factorized an relies upon apparel tri-diagonal solver the Poisson problem is solved by means of a parallel multi-grid technique similar to MUDPACK. for the solid phase we employ a master-slaves technique where one processor handles all the particles contained in its Eulerian fluid sub-domain and zero or more neighbor processors collaborate in the computation of particle-related quantities whenever a particle position over laps the boundary of a sub-domain. the parallel efficiency for some preliminary computations is presented. (Author) 9 refs.
Welch, L.C.; Moog, T.H.; Daly, R.T.; Videbaek, F.
1987-05-01
The ever increasing complexity of nuclear physics experiments places severe demands on computerized data acquisition systems. A natural evolution of these systems, taking advantages of the independent nature of ''events,'' is to use identical parallel microcomputers in a front end to simultaneously analyze separate events. Such a system has been developed at Argonne to serve the needs of the experimental program of ATLAS, a new superconducting heavy-ion accelerator and other on-going research. Using microcomputers based on the National Semiconductor 32016 microprocessor housed in a Multibus I cage, CPU power equivalent to several VAXs is obtained at a fraction of the cost of one VAX. The front end interfacs to a VAX 11/750 on which an extensive user friendly command language based on DCL resides. The whole system, known as DAPHNE, also provides the means to reply data using the same command language. Design concepts, data structures, performance, and experience to data are discussed
Ahmad, W.
2017-01-01
Streaming applications such as virtual reality, video conferencing, and face detection, impose high demands on a system’s performance and battery life. With the advancement in mobile computing, these applications are increasingly implemented on battery-constrained platforms, such as gaming consoles,
Welch, L.C.; Moog, T.H.; Daly, R.T.; Videbaek, F.
1986-01-01
The ever increasing complexity of nuclear physics experiments places severe demands on computerized data acquisition systems. A natural evolution of these system, taking advantage of the independent nature of ''events'', is to use identical parallel microcomputers in a front end to simultaneously analyze separate events. Such a system has been developed at Argonne to serve the needs of the experimental program of ATLAS, a new superconducting heavy-ion accelerator and other on-going research. Using microcomputers based on the National Semiconductor 32016 microprocessor housed in a Multibus I cage, multi-VAX cpu power is obtained at a fraction of the cost of one VAX. The front end interfaces to a VAX 750 on which an extensive user friendly command language based on DCL resides. The whole system, known as DAPHNE, also provides the means to replay data using the same command language. Design concepts, data structures, performance, and experience to data are discussed. 5 refs., 2 figs
Centralized multiprocessor control system for the frascati storage rings DAΦNE
Di Pirro, G.; Milardi, C.; Serio, M.
1992-01-01
We describe the status of the DANTE (DAΦne New Tools Environment) control system for the new DAΦNE Φ-factory under construction at the Frascati National Laboratories. The system is based on a centralized communication architecture for simplicity and reliability. A central processor unit coordinates all communications between the consoles and the lower level distributed processing power, and continuously updates a central memory that contains the whole machine status. We have developed a system of VME Fiber Optic interfaces allowing very fast point to point communication between distant processors. Macintosh II personal computers are used as consoles. The lower levels are all built using the VME standard. (author)
A universal multiprocessor system for the fast acquisition and processing of positron camera data
Deluigi, B.
1982-01-01
In this study the main components of a suitable detection system were worked out, and their properties were examined. For the measurement of the three-dimensional distribution of radiopharmaka marked by positron emitters in animal-experimental studies first a positron camera was constructed. For the detection of the annihilation quanta serve two opposite lying position-sensitive gamma detectors which are derived in coincidence. Two commercial camera heads working according to the Anger principle were reconstructed for these purposes and switched together by a special interface to the positron camera. By this arrangement a spatial resolution of 0.8 cm FWHM for a line source in the symmetry plane and a coincidence resolution time 2T of 16ns FW0.1M was reached. For the three-dimensional image reconstruction with the data of a positron camera a maximum-likelihood procedure was developed and tested by a Monte Carlo procedure. In view of this application an at most flexible multi-microprocessor system was developed. A high computing capacity is reached owing to the fact that several partial problems are distributed to different processors and are processed parallely. The architecture was so scheduled that the system possesses a high error tolerance and that the computing capacity can be extended without a principal limit. (orig./HSI) [de
The Triangle: a Multiprocessor Architecture for Fast Curve and Surface Generation.
1987-08-01
In this type of design, the designer may begin with a mental image of the desired shape, the goal being communication of that shape to the system. The...system in turn stores some internal representation of the shape. For "free-form" shapes such as the outline of a character in a typography system...by stereo-graphic viewing is that points of evaluation must be passed through two perspective transformations - one for the left eye image and one for
van de Burgwal, M.D.; Rovers, K.C.; Blom, K.C.H.; Kokkeler, Andre B.J.; Smit, Gerardus Johannes Maria
2011-01-01
Traditionally, mechanically steered dishes or analog phased array beamforming systems have been used for radio frequency receivers, where strong directivity and high performance were much more important than low-cost requirements. Real-time controlled digital phased array beamforming could not be
Parallel diffusion calculation for the PHAETON on-line multiprocessor computer
Collart, J.M.; Fedon-Magnaud, C.; Lautard, J.J.
1987-04-01
The aim of the PHAETON project is the design of an on-line computer in order to increase the immediate knowledge of the main operating and safety parameters in power plants. A significant stage is the computation of the three dimensional flux distribution. For cost and safety reason a computer based on a parallel microprocessor architecture has been studied. This paper presents a first approach to parallelized three dimensional diffusion calculation. A computing software has been written and built in a four processors demonstrator. We present the realization in progress, concerning the final equipment. 8 refs
Cooperative Localization for Mobile Networks
Cakmak, Burak; Urup, Daniel Nygaard; Meyer, Florian
2016-01-01
We propose a hybrid message passing method for distributed cooperative localization and tracking of mobile agents. Belief propagation and mean field message passing are employed for, respectively, the motion-related and measurementrelated part of the factor graph. Using a Gaussian belief approxim......We propose a hybrid message passing method for distributed cooperative localization and tracking of mobile agents. Belief propagation and mean field message passing are employed for, respectively, the motion-related and measurementrelated part of the factor graph. Using a Gaussian belief...
Performance comparison analysis library communication cluster system using merge sort
Wulandari, D. A. R.; Ramadhan, M. E.
2018-04-01
Begins by using a single processor, to increase the speed of computing time, the use of multi-processor was introduced. The second paradigm is known as parallel computing, example cluster. The cluster must have the communication potocol for processing, one of it is message passing Interface (MPI). MPI have many library, both of them OPENMPI and MPICH2. Performance of the cluster machine depend on suitable between performance characters of library communication and characters of the problem so this study aims to analyze the comparative performances libraries in handling parallel computing process. The case study in this research are MPICH2 and OpenMPI. This case research execute sorting’s problem to know the performance of cluster system. The sorting problem use mergesort method. The research method is by implementing OpenMPI and MPICH2 on a Linux-based cluster by using five computer virtual then analyze the performance of the system by different scenario tests and three parameters for to know the performance of MPICH2 and OpenMPI. These performances are execution time, speedup and efficiency. The results of this study showed that the addition of each data size makes OpenMPI and MPICH2 have an average speed-up and efficiency tend to increase but at a large data size decreases. increased data size doesn’t necessarily increased speed up and efficiency but only execution time example in 100000 data size. OpenMPI has a execution time greater than MPICH2 example in 1000 data size average execution time with MPICH2 is 0,009721 and OpenMPI is 0,003895 OpenMPI can customize communication needs.
2017-06-01
Example 32-bit Chromosome Genetic Representation of a Semiconductor Application . Source: Bates (2004). The algorithm chooses a random instance of the...generated by a certain input parameter combination may be found. That was the achievement of the genetic algorithm application as well, but neither...DISTRIBUTION CODE 13. ABSTRACT (maximum 200 words) This thesis focuses on the replacement of a genetic algorithm currently used to optimize
Entrywise Squared Transforms for GAMP Supplementary Material
2016-01-01
Supplementary material for a study on Entrywise Squared Transforms for Generalized Approximate Message Passing (GAMP). See the README file for the details.......Supplementary material for a study on Entrywise Squared Transforms for Generalized Approximate Message Passing (GAMP). See the README file for the details....
A lightweight communication library for distributed computing
Groen, D.; Rieder, S.; Grosso, P.; de Laat, C.; Portegies Zwart, S.
2010-01-01
We present MPWide, a platform-independent communication library for performing message passing between computers. Our library allows coupling of several local message passing interface (MPI) applications through a long-distance network and is specifically optimized for such communications. The
Welch, L.C.
1984-01-01
This paper describes a project to meet these data acquisition needs for a new accelerator, ATLAS, being built at Argonne National Laboratory. ATLAS is a heavy-ion linear superconducting accelerator providing beam energies up to 25 MeV/A with a relative spread in beam energy as good as .0001 and a time spread of less than 100 psec. Details about the hardware front end, command language, data structure, and the flow of event treatment are covered.
Koskas, G.
1987-04-01
This paper describes the new data acquisition and control systems of the neutron scattering instruments at the ORPHEE research reactor. The existing system has undergone a complete change: the original CAMAC system and minicomputer controlling each experiment have given way to commercial CPU boards and microcomputers like the IBM PC. The communication links between these 2 components are the IEEE 488 or RS 232 standards. Emphasis is placed on flexibility and modular nature of such a system which makes a maximum use of commercial products thus guaranteeing reliability and ease of use. A study of the requirements and evolutions, technical as well as philosophical, is detailed in order to demonstrate the motivation of the choice of the system architecture. A survey of the various hardware and software achievements and finally an overview on the future improvements is given [fr
Jameson, A.; Leicher, S.; Dawson, J.; Tel Aviv Univ., Israel)
1985-01-01
A multiblock modification of the FLO57 code for three-dimensional wing calculations is described and demonstrated. The theoretical basis of the multistage time-stepping algorithm is reviewed; the multiblock grid structure is explained; and results from a computation of vortical flow past a delta wing, using 2.5 x 10 to the 6th grid points and performed on a Cray X/MP computer with a 128-Mword solid-state storage device, are presented graphically. 6 references
Lallement, Dominique.
1979-01-01
Nuclear reactors are monitored by several systems combined. The hydraulic and mechanical limitations on the equipment and the heat transfer requirements in the core set a reliable working range for the boiler defined with certain safety margins. The control system tends to keep the power plant within this working range. The protection system covers all the electrical and mechanical equipment needed to safeguard the boiler in the event of abnormal transients or accidents accounted for in the design of the plant. On units in service protection is handled by cabled automatic systems. For better reliability and safety operation, greater flexibility of use (modularity, adaptability) and improved start-up criteria by data processing the tendency is to use digital programmed systems. Computers are already present in control systems but their introduction into protection systems meets with some reticence on the part of the nuclear safety authorities. A study on the replacement of conventional by digital protection systems is presented. From choices partly made on the principles which should govern the hardware and software of a protection system the reliability of different structures and elements was examined and an experimental model built with its simulator and test facilities. A prototype based on these options and studies is being built and is to be set up on one of the CEN-G reactors for tests [fr
A RISC multiprocessor event trigger for the data acquisition system of the H1 experiment at HERA
Campbell, A.J.
1991-09-01
In late 1991 HERA will for the first time collide stored beams of electrons and protons. This paper describes the multiple (RISC) modern reduced instruction set processor system for online event filtering and reconstruction installed within the data acquisition system of the H1 experiment. Data is processed at a continuous average rate of ∼ 6 Mbytes/s in parallel by ∼ 20 R3000 VMEbus based monoboard computers providing some 400 mips computing power. (author)
Hance, R.; Areti, H.; Atac, R.
1987-01-01
The ACP Branchbus, a high speed differential bus for data movement in multiprocessing and data acquisition environments, is described. This bus was designed as the central bus in the ACP multiprocessing system. In its full implementation with 16 branches and a bus switch, it will handle data rates of 160 MByte/sec and allow reliable data transmission over inter rack distances. We also summarize applications of the ACP system in experimental data acquisition, triggering and monitoring, with special attention paid to FASTBUS environments
Moy , Christophe; Raulet , Mickaël
2010-01-01
International audience; The design of Software Defined Radio (SDR) equipments (terminals, base stations, etc.) is still very challenging. We propose here a design methodology for ultra-fast prototyping on heterogeneous platforms made of GPPs (General Purpose Processors), DSPs (Digital Signal Processors) and FPGAs (Field Programmable Gate Array). Lying on a component-based approach, the methodology mainly aims at automating as much as possible the design from an algorithmic validation to a mul...
Jan, Y.; Jóźwiak, Lech
Numerous modern applications in various fields, such as communication and networking, multimedia, encryption, etc., impose extremely high demands regarding performance while at the same time requiring low energy consumption, low cost, and short design time. Often these very high demands cannot be
Nelson, A.T.; Goossens, K.G.W.; Akesson, K.B.
2015-01-01
Embedded systems often contain multiple applications, some of which have real-time requirements and whose performance must be guaranteed. To efficiently execute applications, modern embedded systems contain Globally Asynchronous Locally Synchronous (GALS) processors, network on chip, DRAM and SRAM
A RISC multiprocessor event trigger for the data acquisition system of the H1 experiment at HERA
Campbell, A.J.
1992-01-01
In late 1991 HERA will collide for the first time stored beams of electrons and protons. This paper describes the multiple RISC processor system for online event filtering and reconstruction installed within the data acquisition system of the H1 experiment. Data is processed at a continuous average rate of ∼6 Mbytes/s in parallel by ∼20 R3000 VMEbus based monoboard computers providing some 400 MIPS computing power
A parallel-pipelined architecture for a multi carrier demodulator
Kwatra, S. C.; Jamali, M. M.; Eugene, Linus P.
1991-03-01
Analog devices have been used for processing the information on board the satellites. Presently, digital devices are being used because they are economical and flexible as compared to their analog counterparts. Several schemes of digital transmission can be used depending on the data rate requirement of the user. An economical scheme of transmission for small earth stations uses single channel per carrier/frequency division multiple access (SCPC/FDMA) on the uplink and time division multiplexing (TDM) on the downlink. This is a typical communication service offered to low data rate users in commercial mass market. These channels usually pertain to either voice or data transmission. An efficient digital demodulator architecture is provided for a large number of law data rate users. A demodulator primarily consists of carrier, clock, and data recovery modules. This design uses principles of parallel processing, pipelining, and time sharing schemes to process large numbers of voice or data channels. It maintains the optimum throughput which is derived from the designed architecture and from the use of high speed components. The design is optimized for reduced power and area requirements. This is essential for satellite applications. The design is also flexible in processing a group of a varying number of channels. The algorithms that are used are verified by the use of a computer aided software engineering (CASE) tool called the Block Oriented System Simulator. The data flow, control circuitry, and interface of the hardware design is simulated in C language. Also, a multiprocessor approach is provided to map, model, and simulate the demodulation algorithms mainly from a speed view point. A hypercude based architecture implementation is provided for such a scheme of operation. The hypercube structure and the demodulation models on hypercubes are simulated in Ada.
High performance electromagnetic simulation tools
Gedney, Stephen D.; Whites, Keith W.
1994-10-01
Army Research Office Grant #DAAH04-93-G-0453 has supported the purchase of 24 additional compute nodes that were installed in the Intel iPsC/860 hypercube at the Univesity Of Kentucky (UK), rendering a 32-node multiprocessor. This facility has allowed the investigators to explore and extend the boundaries of electromagnetic simulation for important areas of defense concerns including microwave monolithic integrated circuit (MMIC) design/analysis and electromagnetic materials research and development. The iPSC/860 has also provided an ideal platform for MMIC circuit simulations. A number of parallel methods based on direct time-domain solutions of Maxwell's equations have been developed on the iPSC/860, including a parallel finite-difference time-domain (FDTD) algorithm, and a parallel planar generalized Yee-algorithm (PGY). The iPSC/860 has also provided an ideal platform on which to develop a 'virtual laboratory' to numerically analyze, scientifically study and develop new types of materials with beneficial electromagnetic properties. These materials simulations are capable of assembling hundreds of microscopic inclusions from which an electromagnetic full-wave solution will be obtained in toto. This powerful simulation tool has enabled research of the full-wave analysis of complex multicomponent MMIC devices and the electromagnetic properties of many types of materials to be performed numerically rather than strictly in the laboratory.
High-speed computation of the EM algorithm for PET image reconstruction
Rajan, K.; Patnaik, L.M.; Ramakrishna, J.
1994-01-01
The PET image reconstruction based on the EM algorithm has several attractive advantages over the conventional convolution backprojection algorithms. However, two major drawbacks have impeded the routine use of the EM algorithm, namely, the long computational time due to slow convergence and the large memory required for the storage of the image, projection data and the probability matrix. In this study, the authors attempts to solve these two problems by parallelizing the EM algorithm on a multiprocessor system. The authors have implemented an extended hypercube (EH) architecture for the high-speed computation of the EM algorithm using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PEs). The authors discuss and compare the performance of the EM algorithm on a 386/387 machine, CD 4360 mainframe, and on the EH system. The results show that the computational speed performance of an EH using DSP chips as PEs executing the EM image reconstruction algorithm is about 130 times better than that of the CD 4360 mainframe. The EH topology is expandable with more number of PEs
Merging Belief Propagation and the Mean Field Approximation: A Free Energy Approach
Riegler, Erwin; Kirkelund, Gunvor Elisabeth; Manchón, Carles Navarro
2013-01-01
We present a joint message passing approach that combines belief propagation and the mean field approximation. Our analysis is based on the region-based free energy approximation method proposed by Yedidia et al. We show that the message passing fixed-point equations obtained with this combination...... correspond to stationary points of a constrained region-based free energy approximation. Moreover, we present a convergent implementation of these message passing fixed-point equations provided that the underlying factor graph fulfills certain technical conditions. In addition, we show how to include hard...
Revolutionary Advances in Ubiquitious, Real-Time Multicomputers and Runtime Environments
Agrawala, Ashok
2001-01-01
This work was a grant to enhance the Maruti operating system in several ways, in order to provide Mississippi State with a platform upon which their work on the Real-Time Message Passing Interface could be developed...
High-Level Management of Communication Schedules in HPF-like Languages
Benkner, Siegfried
1997-01-01
..., providing the users with a high-level language interface for programming scalable parallel architectures and delegating to the compiler the task of producing an explicitly parallel message-passing program...
Parallel Monte Carlo simulation of aerosol dynamics
Zhou, K.; He, Z.; Xiao, M.; Zhang, Z.
2014-01-01
is simulated with a stochastic method (Marcus-Lushnikov stochastic process). Operator splitting techniques are used to synthesize the deterministic and stochastic parts in the algorithm. The algorithm is parallelized using the Message Passing Interface (MPI
Hisley, Dixie
1999-01-01
.... The goals of this report are: (1) to investigate the performance of message passing and loop level parallelization techniques, as they were implemented in the computational fluid dynamics (CFD...
Receiver Architectures for MIMO-OFDM Based on a Combined VMP-SP Algorithm
Manchón, Carles Navarro; Kirkelund, Gunvor Elisabeth; Riegler, Erwin
2011-01-01
, such as the sum-product (SP) and variational message passing (VMP) algorithms, have become increasingly popular. In this contribution, we apply a combined VMP-SP message-passing technique to the design of receivers for MIMO-ODFM systems. The message-passing equations of the combined scheme can be obtained from......Iterative information processing, either based on heuristics or analytical frameworks, has been shown to be a very powerful tool for the design of efficient, yet feasible, wireless receiver architectures. Within this context, algorithms performing message-passing on a probabilistic graph...... assessment of our solutions, based on Monte Carlo simulations, corroborates the high performance of the proposed algorithms and their superiority to heuristic approaches....
Building Blocks for the Rapid Development of Parallel Simulations, Phase I
National Aeronautics and Space Administration — Scientists need to be able to quickly develop and run parallel simulations without paying the high price of writing low-level message passing codes using compiled...
A platform independent communication library for distributed computing
Groen, D.; Rieder, S.; Grosso, P.; de Laat, C.; Portegies Zwart, S.
2010-01-01
We present MPWide, a platform independent communication library for performing message passing between supercomputers. Our library couples several local MPI applications through a long distance network using, for example, optical links. The implementation is deliberately kept light-weight, platform
A portable implementation of ARPACK for distributed memory parallel architectures
Maschhoff, K.J.; Sorensen, D.C.
1996-12-31
ARPACK is a package of Fortran 77 subroutines which implement the Implicitly Restarted Arnoldi Method used for solving large sparse eigenvalue problems. A parallel implementation of ARPACK is presented which is portable across a wide range of distributed memory platforms and requires minimal changes to the serial code. The communication layers used for message passing are the Basic Linear Algebra Communication Subprograms (BLACS) developed for the ScaLAPACK project and Message Passing Interface(MPI).
TESLA: Large Signal Simulation Code for Klystrons
Vlasov, Alexander N.; Cooke, Simon J.; Chernin, David P.; Antonsen, Thomas M. Jr.; Nguyen, Khanh T.; Levush, Baruch
2003-01-01
TESLA (Telegraphist's Equations Solution for Linear Beam Amplifiers) is a new code designed to simulate linear beam vacuum electronic devices with cavities, such as klystrons, extended interaction klystrons, twistrons, and coupled cavity amplifiers. The model includes a self-consistent, nonlinear solution of the three-dimensional electron equations of motion and the solution of time-dependent field equations. The model differs from the conventional Particle in Cell approach in that the field spectrum is assumed to consist of a carrier frequency and its harmonics with slowly varying envelopes. Also, fields in the external cavities are modeled with circuit like equations and couple to fields in the beam region through boundary conditions on the beam tunnel wall. The model in TESLA is an extension of the model used in gyrotron code MAGY. The TESLA formulation has been extended to be capable to treat the multiple beam case, in which each beam is transported inside its own tunnel. The beams interact with each other as they pass through the gaps in their common cavities. The interaction is treated by modification of the boundary conditions on the wall of each tunnel to include the effect of adjacent beams as well as the fields excited in each cavity. The extended version of TESLA for the multiple beam case, TESLA-MB, has been developed for single processor machines, and can run on UNIX machines and on PC computers with a large memory (above 2GB). The TESLA-MB algorithm is currently being modified to simulate multiple beam klystrons on multiprocessor machines using the MPI (Message Passing Interface) environment. The code TESLA has been verified by comparison with MAGIC for single and multiple beam cases. The TESLA code and the MAGIC code predict the same power within 1% for a simple two cavity klystron design while the computational time for TESLA is orders of magnitude less than for MAGIC 2D. In addition, recently TESLA was used to model the L-6048 klystron, code
Efficient parallel implicit methods for rotary-wing aerodynamics calculations
Wissink, Andrew M.
Euler/Navier-Stokes Computational Fluid Dynamics (CFD) methods are commonly used for prediction of the aerodynamics and aeroacoustics of modern rotary-wing aircraft. However, their widespread application to large complex problems is limited lack of adequate computing power. Parallel processing offers the potential for dramatic increases in computing power, but most conventional implicit solution methods are inefficient in parallel and new techniques must be adopted to realize its potential. This work proposes alternative implicit schemes for Euler/Navier-Stokes rotary-wing calculations which are robust and efficient in parallel. The first part of this work proposes an efficient parallelizable modification of the Lower Upper-Symmetric Gauss Seidel (LU-SGS) implicit operator used in the well-known Transonic Unsteady Rotor Navier Stokes (TURNS) code. The new hybrid LU-SGS scheme couples a point-relaxation approach of the Data Parallel-Lower Upper Relaxation (DP-LUR) algorithm for inter-processor communication with the Symmetric Gauss Seidel algorithm of LU-SGS for on-processor computations. With the modified operator, TURNS is implemented in parallel using Message Passing Interface (MPI) for communication. Numerical performance and parallel efficiency are evaluated on the IBM SP2 and Thinking Machines CM-5 multi-processors for a variety of steady-state and unsteady test cases. The hybrid LU-SGS scheme maintains the numerical performance of the original LU-SGS algorithm in all cases and shows a good degree of parallel efficiency. It experiences a higher degree of robustness than DP-LUR for third-order upwind solutions. The second part of this work examines use of Krylov subspace iterative solvers for the nonlinear CFD solutions. The hybrid LU-SGS scheme is used as a parallelizable preconditioner. Two iterative methods are tested, Generalized Minimum Residual (GMRES) and Orthogonal s-Step Generalized Conjugate Residual (OSGCR). The Newton method demonstrates good
Computationally efficient implementation of combustion chemistry in parallel PDF calculations
Lu Liuyan; Lantz, Steven R.; Ren Zhuyin; Pope, Stephen B.
2009-01-01
In parallel calculations of combustion processes with realistic chemistry, the serial in situ adaptive tabulation (ISAT) algorithm [S.B. Pope, Computationally efficient implementation of combustion chemistry using in situ adaptive tabulation, Combustion Theory and Modelling, 1 (1997) 41-63; L. Lu, S.B. Pope, An improved algorithm for in situ adaptive tabulation, Journal of Computational Physics 228 (2009) 361-386] substantially speeds up the chemistry calculations on each processor. To improve the parallel efficiency of large ensembles of such calculations in parallel computations, in this work, the ISAT algorithm is extended to the multi-processor environment, with the aim of minimizing the wall clock time required for the whole ensemble. Parallel ISAT strategies are developed by combining the existing serial ISAT algorithm with different distribution strategies, namely purely local processing (PLP), uniformly random distribution (URAN), and preferential distribution (PREF). The distribution strategies enable the queued load redistribution of chemistry calculations among processors using message passing. They are implemented in the software x2f m pi, which is a Fortran 95 library for facilitating many parallel evaluations of a general vector function. The relative performance of the parallel ISAT strategies is investigated in different computational regimes via the PDF calculations of multiple partially stirred reactors burning methane/air mixtures. The results show that the performance of ISAT with a fixed distribution strategy strongly depends on certain computational regimes, based on how much memory is available and how much overlap exists between tabulated information on different processors. No one fixed strategy consistently achieves good performance in all the regimes. Therefore, an adaptive distribution strategy, which blends PLP, URAN and PREF, is devised and implemented. It yields consistently good performance in all regimes. In the adaptive parallel
S. Raia
2014-03-01
of several model runs obtained varying the input parameters are analyzed statistically, and compared to the original (deterministic model output. The comparison suggests an improvement of the predictive power of the model of about 10% and 16% in two small test areas, that is, the Frontignano (Italy and the Mukilteo (USA areas. We discuss the computational requirements of TRIGRS-P to determine the potential use of the numerical model to forecast the spatial and temporal occurrence of rainfall-induced shallow landslides in very large areas, extending for several hundreds or thousands of square kilometers. Parallel execution of the code using a simple process distribution and the message passing interface (MPI on multi-processor machines was successful, opening the possibly of testing the use of TRIGRS-P for the operational forecasting of rainfall-induced shallow landslides over large regions.
Raia, S.; Alvioli, M.; Rossi, M.; Baum, R.L.; Godt, J.W.; Guzzetti, F.
2013-01-01
are analyzed statistically, and compared to the original (deterministic) model output. The comparison suggests an improvement of the predictive power of the model of about 10% and 16% in two small test areas, i.e. the Frontignano (Italy) and the Mukilteo (USA) areas, respectively. We discuss the computational requirements of TRIGRS-P to determine the potential use of the numerical model to forecast the spatial and temporal occurrence of rainfall-induced shallow landslides in very large areas, extending for several hundreds or thousands of square kilometers. Parallel execution of the code using a simple process distribution and the Message Passing Interface (MPI) on multi-processor machines was successful, opening the possibly of testing the use of TRIGRS-P for the operational forecasting of rainfall-induced shallow landslides over large regions.
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.
Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto
2015-02-13
In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the
Nieuwland, A.K.; Kang, J.; Gangwal, O.P.; Sethuraman, R.; Busá, N.G.; Goossens, K.G.W.; Peset Llopis, R.; Lippens, P.E.R.
2002-01-01
The key issue in the design of Systems-on-a-Chip (SoC) is to trade-off efficiency against flexibility, and time to market versus cost. Current deep submicron processing technologies enable integration of multiple software programmable processors (e.g., CPUs, DSPs) and dedicated hardware components
Poggioli, Jean Renaud
1984-01-01
This research thesis reports the development of a software for an acquisition computer aimed at the on-line control of nuclear physics experiments. An original architecture, based on the assignment of a processor to each fundamental task, enables the implementation of a high performance system. In order to make the user free of programming constraints, the author developed a software for dynamic generation of acquisition and processing codes. These codes are created from a data base which is programmed by the user by using a language close to the physical reality. Procedures of interactive control of the experiment are thus simplified by displaying function menus on the operator terminal. The author evokes possible hardware improvements and possible extensions of the system [fr
Mirin, A.A.
1988-01-01
A formula is derived for predicting multiprocessing efficiency on Cray supercomputers equipped with the Cray Time-Sharing System (CTSS). The model is applicable to an intensive time-sharing environment. The actual efficiency estimate depends on three factors: the code size, task length, and job mix. The implementation of multitasking in a three-dimensional plasma magnetohydrodynamics (MHD) code, TEMCO, is discussed. TEMCO solves the primitive one-fluid compressible MHD equations and includes resistive and Hall effects in Ohm's law. Virtually all segments of the main time-integration loop are multitasked. The multiprocessing efficiency model is applied to TEMCO. Excellent agreement is obtained between the actual multiprocessing efficiency and the theoretical prediction
A parallel solver for huge dense linear systems
Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.
2011-11-01
HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system
Domínguez, Eduardo; Lage-Castellanos, Alejandro; Mulet, Roberto; Ricci-Tersenghi, Federico; Rizzo, Tommaso
2011-01-01
We study the performance of different message passing algorithms in the two-dimensional Edwards–Anderson model. We show that the standard belief propagation (BP) algorithm converges only at high temperature to a paramagnetic solution. Then, we test a generalized belief propagation (GBP) algorithm, derived from a cluster variational method (CVM) at the plaquette level. We compare its performance with BP and with other algorithms derived under the same approximation: double loop (DL) and a two-way message passing algorithm (HAK). The plaquette-CVM approximation improves BP in at least three ways: the quality of the paramagnetic solution at high temperatures, a better estimate (lower) for the critical temperature, and the fact that the GBP message passing algorithm converges also to nonparamagnetic solutions. The lack of convergence of the standard GBP message passing algorithm at low temperatures seems to be related to the implementation details and not to the appearance of long range order. In fact, we prove that a gauge invariance of the constrained CVM free energy can be exploited to derive a new message passing algorithm which converges at even lower temperatures. In all its region of convergence this new algorithm is faster than HAK and DL by some orders of magnitude
MPIRUN: A Portable Loader for Multidisciplinary and Multi-Zonal Applications
Fineberg, Samuel A.; Woodrow, Thomas S. (Technical Monitor)
1994-01-01
Multidisciplinary and multi-zonal applications are an important class of applications in the area of Computational Aerosciences. In these codes, two or more distinct parallel programs or copies of a single program are utilized to model a single problem. To support such applications, it is common to use a programming model where a program is divided into several single program multiple data stream (SPMD) applications, each of which solves the equations for a single physical discipline or grid zone. These SPMD applications are then bound together to form a single multidisciplinary or multi-zonal program in which the constituent parts communicate via point-to-point message passing routines. One method for implementing the message passing portion of these codes is with the new Message Passing Interface (MPI) standard. Unfortunately, this standard only specifies the message passing portion of an application, but does not specify any portable mechanisms for loading an application. MPIRUN was developed to provide a portable means for loading MPI programs, and was specifically targeted at multidisciplinary and multi-zonal applications. Programs using MPIRUN for loading and MPI for message passing are then portable between all machines supported by MPIRUN. MPIRUN is currently implemented for the Intel iPSC/860, TMC CM5, IBM SP-1 and SP-2, Intel Paragon, and workstation clusters. Further, MPIRUN is designed to be simple enough to port easily to any system supporting MPI.
Phillips, Jennifer K.
1995-01-01
Two of the current and most popular implementations of the Message-Passing Standard, Message Passing Interface (MPI), were contrasted: MPICH by Argonne National Laboratory, and LAM by the Ohio Supercomputer Center at Ohio State University. A parallel skyline matrix solver was adapted to be run in a heterogeneous environment using MPI. The Message-Passing Interface Forum was held in May 1994 which lead to a specification of library functions that implement the message-passing model of parallel communication. LAM, which creates it's own environment, is more robust in a highly heterogeneous network. MPICH uses the environment native to the machine architecture. While neither of these free-ware implementations provides the performance of native message-passing or vendor's implementations, MPICH begins to approach that performance on the SP-2. The machines used in this study were: IBM RS6000, 3 Sun4, SGI, and the IBM SP-2. Each machine is unique and a few machines required specific modifications during the installation. When installed correctly, both implementations worked well with only minor problems.
Iterated Process Analysis over Lattice-Valued Regular Expressions
Midtgaard, Jan; Nielson, Flemming; Nielson, Hanne Riis
2016-01-01
We present an iterated approach to statically analyze programs of two processes communicating by message passing. Our analysis operates over a domain of lattice-valued regular expressions, and computes increasingly better approximations of each process's communication behavior. Overall the work e...... extends traditional semantics-based program analysis techniques to automatically reason about message passing in a manner that can simultaneously analyze both values of variables as well as message order, message content, and their interdependencies.......We present an iterated approach to statically analyze programs of two processes communicating by message passing. Our analysis operates over a domain of lattice-valued regular expressions, and computes increasingly better approximations of each process's communication behavior. Overall the work...
Towards deductive verification of MPI programs against session types
Eduardo R. B. Marques
2013-12-01
Full Text Available The Message Passing Interface (MPI is the de facto standard message-passing infrastructure for developing parallel applications. Two decades after the first version of the library specification, MPI-based applications are nowadays routinely deployed on super and cluster computers. These applications, written in C or Fortran, exhibit intricate message passing behaviours, making it hard to statically verify important properties such as the absence of deadlocks. Our work builds on session types, a theory for describing protocols that provides for correct-by-construction guarantees in this regard. We annotate MPI primitives and C code with session type contracts, written in the language of a software verifier for C. Annotated code is then checked for correctness with the software verifier. We present preliminary results and discuss the challenges that lie ahead for verifying realistic MPI program compliance against session types.
Merging Belief Propagation and the Mean Field Approximation
Riegler, Erwin; Kirkelund, Gunvor Elisabeth; Manchón, Carles Navarro
2010-01-01
We present a joint message passing approach that combines belief propagation and the mean field approximation. Our analysis is based on the region-based free energy approximation method proposed by Yedidia et al., which allows to use the same objective function (Kullback-Leibler divergence......) as a starting point. In this method message passing fixed point equations (which correspond to the update rules in a message passing algorithm) are then obtained by imposing different region-based approximations and constraints on the mean field and belief propagation parts of the corresponding factor graph....... Our results can be applied, for example, to algorithms that perform joint channel estimation and decoding in iterative receivers. This is demonstrated in a simple example....
Automatic Migration from PARMACS to MPI in Parallel Fortran Applications
Rolf Hempel
1999-01-01
Full Text Available The PARMACS message passing interface has been in widespread use by application projects, especially in Europe. With the new MPI standard for message passing, many projects face the problem of replacing PARMACS with MPI. An automatic translation tool has been developed which replaces all PARMACS 6.0 calls in an application program with their corresponding MPI calls. In this paper we describe the mapping of the PARMACS programming model onto MPI. We then present some implementation details of the converter tool.
Turbo Equalization Using Partial Gaussian Approximation
Zhang, Chuanzong; Wang, Zhongyong; Manchón, Carles Navarro
2016-01-01
This letter deals with turbo equalization for coded data transmission over intersymbol interference (ISI) channels. We propose a message-passing algorithm that uses the expectation propagation rule to convert messages passed from the demodulator and decoder to the equalizer and computes messages...... returned by the equalizer by using a partial Gaussian approximation (PGA). We exploit the specific structure of the ISI channel model to compute the latter messages from the beliefs obtained using a Kalman smoother/equalizer. Doing so leads to a significant complexity reduction compared to the initial PGA...
A new shared-memory programming paradigm for molecular dynamics simulations on the Intel Paragon
D'Azevedo, E.F.; Romine, C.H.
1994-12-01
This report describes the use of shared memory emulation with DOLIB (Distributed Object Library) to simplify parallel programming on the Intel Paragon. A molecular dynamics application is used as an example to illustrate the use of the DOLIB shared memory library. SOTON-PAR, a parallel molecular dynamics code with explicit message-passing using a Lennard-Jones 6-12 potential, is rewritten using DOLIB primitives. The resulting code has no explicit message primitives and resembles a serial code. The new code can perform dynamic load balancing and achieves better performance than the original parallel code with explicit message-passing
Parkhurst, David L.; Kipp, Kenneth L.; Engesgaard, Peter; Charlton, Scott R.
2004-01-01
format suitable for exporting to spreadsheets and post-processing programs; or in Hierarchical Data Format (HDF), which is a compressed binary format. Data in the HDF file can be visualized on Windows computers with the program Model Viewer and extracted with the utility program PHASTHDF; both programs are distributed with PHAST. Operator splitting of the flow, transport, and geochemical equations is used to separate the three processes into three sequential calculations. No iterations between transport and reaction calculations are implemented. A three-dimensional Cartesian coordinate system and finite-difference techniques are used for the spatial and temporal discretization of the flow and transport equations. The non-linear chemical equilibrium equations are solved by a Newton-Raphson method, and the kinetic reaction equations are solved by a Runge-Kutta or an implicit method for integrating ordinary differential equations. The PHAST simulator may require large amounts of memory and long Central Processing Unit (CPU) times. To reduce the long CPU times, a parallel version of PHAST has been developed that runs on a multiprocessor computer or on a collection of computers that are networked. The parallel version requires Message Passing Interface, which is currently (2004) freely available. The parallel version is effective in reducing simulation times. This report documents the use of the PHAST simulator, including running the simulator, preparing the input files, selecting the output files, and visualizing the results. It also presents four examples that verify the numerical method and demonstrate the capabilities of the simulator. PHAST requires three input files. Only the flow and transport file is described in detail in this report. The other two files, the chemistry data file and the database file, are identical to PHREEQC files and the detailed description of these files is found in the PHREEQC documentation.
Motion description for data compression and classification
Lades, M.
1998-01-01
Data compression and processing of image sequences are becoming increasingly important in the era of the information superhighway. This project aims at the development and proof-of-principle of new methods for motion extraction, image sequence compression, and motion analysis. These methods will increase the efficiency of recognition systems and various database applications. The early research into such novel concepts at the forefront of computer vision will benefit LLNL and DOE in all areas associated with archived images and image sequence data. Automated security and surveillance applications are also of special interest in this context. In FY 1997, we started developing a parallel implementation of the face recognition paradigm on the message passing interface (MPI). A parallel implementation is essential to understanding the structure of large image-databases. Our algorithms are now available to interested parties for applications such as scientific data management (SDM). We also are implementing our new algorithms as a growing library of C++ objects. During FY 1997, we focused our research efforts on designing and delivering hardware. In particular, we (1) established the capability to design new retinas and other very-large-scale integrated (VLSI) hardware at LLNL's Institute for Scientific Computing Research (ISCR) and (2) fabricated prototypes through MOSIS, a University of Southern California center for experimental VLSI design. Leveraging our connections to the analog VLSI research community -- particularly connections with the California Institute of Technology, the University of Zurich, and the University of California, San Diego -- we collaborated with Southern Illinois University (SIU) to develop a novel silicon retina with mixed signal processing capabilities. The ITTRACS device delivers the equivalent of 50 x 10 9 operations (50 GOPS) per second. It performs light detection with logarithmic sensitivity, edge extraction, frame differencing, and
Keep Net Working - On a Dependable and Fast Networking Stack
Hruby, T.; Vogt, D.; Bos, H.J.; Tanenbaum, A.S.
2012-01-01
For many years, multiserver 1 operating systems have been demonstrating, by their design, high dependability and reliability. However, the design has inherent performance implications which were not easy to overcome. Until now the context switching and kernel involvement in the message passing was
The Argo NOC: Combining TDM and GALS
Kasapaki, Evangelia; Sparsø, Jens
2015-01-01
Argo is a network-on-chip developed for use in a multi-core platform designed specifically for hard real-time applications and it supports message passing across virtual end-to-end channels. Argo implements these channels using time-division-multiplexing (TDM) of the resources in the NOC following...
LDPC Codes--Structural Analysis and Decoding Techniques
Zhang, Xiaojie
2012-01-01
Low-density parity-check (LDPC) codes have been the focus of much research over the past decade thanks to their near Shannon limit performance and to their efficient message-passing (MP) decoding algorithms. However, the error floor phenomenon observed in MP decoding, which manifests itself as an abrupt change in the slope of the error-rate curve,…
Parallel sparse direct solvers for Poisson's equation in streamer discharges
M. Nool (Margreet); M. Genseberger (Menno); U. M. Ebert (Ute)
2017-01-01
textabstractThe aim of this paper is to examine whether a hybrid approach of parallel computing, a combination of the message passing model (MPI) with the threads model (OpenMP) can deliver good performance in streamer discharge simulations. Since one of the bottlenecks of almost all streamer
Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
Hasanov, Khalid; Quintin, Jean-Noë l; Lastovetsky, Alexey
2014-01-01
-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel
Programming Models for Three-Dimensional Hydrodynamics on the CM-5 (Part II)
Amala, P.A.K.; Rodrigue, G.H.
1994-01-01
This is a two-part presentation of a timing study on the Thinking Machines CORP. CM-5 computer. Part II is given in this study and represents domain-decomposition and message-passing models. Part I described computational problems using a SIMD model and connection machine FORTRAN (CMF)
Experiences implementing the MPI standard on Sandia`s lightweight kernels
Brightwell, R.; Greenberg, D.S.
1997-10-01
This technical report describes some lessons learned from implementing the Message Passing Interface (MPI) standard, and some proposed extentions to MPI, at Sandia. The implementations were developed using Sandia-developed lightweight kernels running on the Intel Paragon and Intel TeraFLOPS platforms. The motivations for this research are discussed, and a detailed analysis of several implementation issues is presented.
Seymour Cray's idea was to build a 'balanced system', that is, a system whose ... operations per second in order to solve problems such as .... This is called a uniform address space and the time to access a .... CEs and managing message routing between CEs. .... which contains user callable routines for message passing,.
Intra and Inter-IOM Ccommunications Summary Document
Ambrosini, G; Cetin, S A; Conka, T; Fernandes, A; Francis, D; Joos, M; Lehmann, G; Mailov, A; Mapelli, L; Mornacchi, Giuseppe; Niculescu, M; Nurdan, K; Petersen, J; Spiwoks, R; Tremblet, L J; Ünel, G
1999-01-01
This document summarises the work performed, within the context of the DAQ-Unit of the DataFlow system in ATLAS DAQ/EF prototype -1 on intra and inter-Input/Output Module (IOM) message passing. This document fulfils the ATLAS DAQ/EF prototype -1 milestone of February 99.
Final report: Compiled MPI. Cost-Effective Exascale Application Development
Gropp, William Douglas [Univ. of Illinois, Urbana-Champaign, IL (United States)
2015-12-21
This is the final report on Compiled MPI: Cost-Effective Exascale Application Development, and summarizes the results under this project. The project investigated runtime enviroments that improve the performance of MPI (Message-Passing Interface) programs; work at Illinois in the last period of this project looked at optimizing data access optimizations expressed with MPI datatypes.
An OCP Compliant Network Adapter for GALS-based SoC Design Using the MANGO Network-on-Chip
Bjerregaard, Tobias; Mahadevan, Shankar; Olsen, Rasmus Grøndahl
2005-01-01
decouples communication and computation, providing memory-mapped OCP transactions based on primitive message-passing services of the network. Also, it facilitates GALS-type systems, by adapting to the clockless network. This helps leverage a modular SoC design flow. We evaluate performance and cost of 0...
S-AMP for non-linear observation models
Cakmak, Burak; Winther, Ole; Fleury, Bernard H.
2015-01-01
Recently we presented the S-AMP approach, an extension of approximate message passing (AMP), to be able to handle general invariant matrix ensembles. In this contribution we extend S-AMP to non-linear observation models. We obtain generalized AMP (GAMP) as the special case when the measurement...
A Down-to-Earth Educational Operating System for Up-in-the-Cloud Many-Core Architectures
Ziwisky, Michael; Persohn, Kyle; Brylow, Dennis
2013-01-01
We present "Xipx," the first port of a major educational operating system to a processor in the emerging class of many-core architectures. Through extensions to the proven Embedded Xinu operating system, Xipx gives students hands-on experience with system programming in a distributed message-passing environment. We expose the software primitives…
pupyMPI - MPI implemented in pure Python
Bromer, Rune; Hantho, Frederik; Vinter, Brian
2011-01-01
As distributed memory systems have become common, the de facto standard for communication is still the Message Passing Interface (MPI). pupyMPI is a pure Python implementation of a broad subset of the MPI 1.3 specifications that allows Python programmers to utilize multiple CPUs with datatypes...
From concatenated codes to graph codes
Justesen, Jørn; Høholdt, Tom
2004-01-01
We consider codes based on simple bipartite expander graphs. These codes may be seen as the first step leading from product type concatenated codes to more complex graph codes. We emphasize constructions of specific codes of realistic lengths, and study the details of decoding by message passing...
Distributed data acquisition for BNL802 II: Software
Watson, W.A. III; LeVine, M.J.
1987-01-01
Multiple processes running on a VAX and on VME-based processors allow up to 16 detector sub-systems to run independently or coupled, with an aggregate throughput of 600 Kbytes/sec. VMS facilities are used extensively for command definition, message passing, controlling access to CAMAC and high voltage modules, and maintaining shared data structures
The Hamlet design entry system: on the implementation of synchronous channels
M.R. van Steen
1994-01-01
textabstractThe graphical Hamlet Application Design Language (ADL) has been developed to support the construction of parallel applications. The language is based on a notion of processes communicating by means of message-passing. One of the goals of ADL is that its implementation should allow
Concurrent determination of connected components
Hesselink, Wim H.; Meijster, Arnold; Bron, Coenraad
2001-01-01
The design is described of a parallel version of Tarjan's algorithm for the determination of equivalence classes in graphs that represent images. Distribution of the vertices of the graph over a number of processes leads to a message passing algorithm. The algorithm is mapped to a shared-memory
MAP Detector for Flash Memory Without Accessing the Interfering Cells
Yassine, Hachem; Badiu, Mihai Alin; Coon, Justin P.
2018-01-01
the latency cost of accessing the interfering cells. Specifically, we exploit the fact that adjacent cells have common interferers by modeling the system as an appropriate hidden Markov model. Then we use the sum-product (message-passing) algorithm to compute the marginal posterior probabilities of the stored...
Parallel object-oriented specification language
Florescu, O.; Voeten, J.P.M.; Theelen, B.D.; Geilen, M.C.W.; Corporaal, H.; Burns, Alan
2008-01-01
The Parallel Object-Oriented Specification Language (POOSL) is an expressive modelling language for hardware/software systems [10]. It was originally defined in [7] as an object-oriented extension of process algebra CCS [6], supporting (conditional) synchronous message passing between
A computational test facility for distributed analysis of gravitational wave signals
Amico, P; Bosi, L; Cattuto, C; Gammaitoni, L; Punturo, M; Travasso, F; Vocca, H
2004-01-01
In the gravitational wave detector Virgo, the in-time detection of a gravitational wave signal from a coalescing binary stellar system is an intensive computational task. A parallel computing scheme using the message passing interface (MPI) is described. Performance results on a small-scale cluster are reported
An introduction to compositional methods for concurrency and their application to real-time
Hooman, J.J.M.; Roever, de W.P.
1992-01-01
Formal methods to specify and verify concurrent programs with synchronous message passing are discussed. We stress the development towards compositional methods, i.e. methods in which the specification of a compound program can be inferred from specifications of its constituents without reference to
A Model for Speedup of Parallel Programs
1997-01-01
Sanjeev. K Setia . The interaction between mem- ory allocation and adaptive partitioning in message- passing multicomputers. In IPPS Workshop on Job...Scheduling Strategies for Parallel Processing, pages 89{99, 1995. [15] Sanjeev K. Setia and Satish K. Tripathi. A compar- ative analysis of static
Exposing MPI Objects for Debugging
Brock-Nannestad, Laust; DelSignore, John; Squyres, Jeffrey M.
Developers rely on debuggers to inspect application state. In applications that use MPI, the Message Passing Interface, the MPI runtime contains an important part of this state. The MPI Tools Working Group has proposed an interface for MPI Handle Introspection. It allows debuggers and MPI impleme...
MPI Debugging with Handle Introspection
Brock-Nannestad, Laust; DelSignore, John; Squyres, Jeffrey M.
The Message Passing Interface, MPI, is the standard programming model for high performance computing clusters. However, debugging applications on large scale clusters is difficult. The widely used Message Queue Dumping interface enables inspection of message queue state but there is no general in...
Shehzad, Danish; Bozkuş, Zeki
2016-01-01
Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collecti...
Poggioli, Jean Renaud
1984-03-02
This research thesis reports the development of a software for an acquisition computer aimed at the on-line control of nuclear physics experiments. An original architecture, based on the assignment of a processor to each fundamental task, enables the implementation of a high performance system. In order to make the user free of programming constraints, the author developed a software for dynamic generation of acquisition and processing codes. These codes are created from a data base which is programmed by the user by using a language close to the physical reality. Procedures of interactive control of the experiment are thus simplified by displaying function menus on the operator terminal. The author evokes possible hardware improvements and possible extensions of the system [French] Cette these rend compte du developpement logiciel realise pour un calculateur d'acquisition destine au controle en ligne d'experiences de Physique Nucleaire. Une architecture originale, basee sur l'attribution d'un processeur a chacune des taches fondamentales permet de facon simple la mise en oeuvre d'un systeme a hautes performances. Le souci de liberer l'utilisateur des contraintes de programmation a conduit a l'elaboration d'un logiciel de generation dynamique des codes acquisition et traitement; ces derniers sont crees a partir d'une base de donnees que l'experimentateur programme a l'aide d'un langage approchant la realite physique. Les procedures de controle interactif de l'experience se trouvent simplifiees par l'affichage de menus de fonctions sur la console operateur. En conclusion sont evoquees les ameliorations materielles et les extensions possibles du systeme. (auteur)
High-speed parallel implementation of a modified PBR algorithm on DSP-based EH topology
Rajan, K.; Patnaik, L. M.; Ramakrishna, J.
1997-08-01
Algebraic Reconstruction Technique (ART) is an age-old method used for solving the problem of three-dimensional (3-D) reconstruction from projections in electron microscopy and radiology. In medical applications, direct 3-D reconstruction is at the forefront of investigation. The simultaneous iterative reconstruction technique (SIRT) is an ART-type algorithm with the potential of generating in a few iterations tomographic images of a quality comparable to that of convolution backprojection (CBP) methods. Pixel-based reconstruction (PBR) is similar to SIRT reconstruction, and it has been shown that PBR algorithms give better quality pictures compared to those produced by SIRT algorithms. In this work, we propose a few modifications to the PBR algorithms. The modified algorithms are shown to give better quality pictures compared to PBR algorithms. The PBR algorithm and the modified PBR algorithms are highly compute intensive, Not many attempts have been made to reconstruct objects in the true 3-D sense because of the high computational overhead. In this study, we have developed parallel two-dimensional (2-D) and 3-D reconstruction algorithms based on modified PBR. We attempt to solve the two problems encountered by the PBR and modified PBR algorithms, i.e., the long computational time and the large memory requirements, by parallelizing the algorithm on a multiprocessor system. We investigate the possible task and data partitioning schemes by exploiting the potential parallelism in the PBR algorithm subject to minimizing the memory requirement. We have implemented an extended hypercube (EH) architecture for the high-speed execution of the 3-D reconstruction algorithm using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PEs) and dual-port random access memories (DPR) as channels between the PEs. We discuss and compare the performances of the PBR algorithm on an IBM 6000 RISC workstation, on a Silicon
Rt-Space: A Real-Time Stochastically-Provisioned Adaptive Container Environment
2017-08-04
Real-Time Systems (ECRTS) Conference Location: Toulouse, France Paper Title: Multiprocessor Real-Time Locking Protocols for Replicated Resources...Conference Location: Lille, France Paper Title: A Contention-Sensitive Fine-Grained Locking Protocol for Multiprocessor Real-Time Systems Publication...On the Soft Real-Time Optimality of Global EDF on Multiprocessors: From Identical to Uniform Heterogeneous Publication Type: Conference Paper or
Scalable Distributed Architectures for Information Retrieval
Lu, Zhihong
1999-01-01
.... Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection...
The communication processor of TUMULT-64
Smit, Gerardus Johannes Maria; Jansen, P.G.
1988-01-01
Tumult (Twente University MULTi-processor system) is a modular extendible multi-processor system designed and implemented at the Twente University of Technology in co-operation with Oce Nederland B.V. and the Dr. Neher Laboratories (Dutch PTT). Characteristics of the hardware are: MIMD type,
Enabling MPSoC design space exploration on FPGAs
Shabbir, A.; Kumar, A.; Mesman, B.; Corporaal, H.; Hussain, D.M.A.; Rajput, A.Q.K.; Chowdhry, B.S.; Gee, Q.
2009-01-01
Future applications for embedded systems demand chip multiprocessor designs to meet real-time deadlines. These multiprocessors are increasingly becoming heterogeneous for reasons of cost and power. Design space exploration (DSE) of application mapping becomes a major design decision in such systems.
Analyzing composability of applications on MPSoC platforms
Kumar, A.; Mesman, B.; Theelen, B.D.; Corporaal, H.; Yajun, H.
2008-01-01
Modern day applications require use of multi-processor systems for reasons of erformance, scalability and power efficiency. As more and more applications are integrated in a single system, mapping and analyzing them on a multi-processor platform becomes a multidimensional problem. Each possible set