WorldWideScience

Sample records for memory parallel programming

  1. MulticoreBSP for C : A high-performance library for shared-memory parallel programming

    NARCIS (Netherlands)

    Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K.

    2014-01-01

    The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the

  2. Introduction to parallel programming

    CERN Document Server

    Brawer, Steven

    1989-01-01

    Introduction to Parallel Programming focuses on the techniques, processes, methodologies, and approaches involved in parallel programming. The book first offers information on Fortran, hardware and operating system models, and processes, shared memory, and simple parallel programs. Discussions focus on processes and processors, joining processes, shared memory, time-sharing with multiple processors, hardware, loops, passing arguments in function/subroutine calls, program structure, and arithmetic expressions. The text then elaborates on basic parallel programming techniques, barriers and race

  3. The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition

    Directory of Open Access Journals (Sweden)

    Ashkan Tousimojarad

    2013-12-01

    Full Text Available We present the Glasgow Parallel Reduction Machine (GPRM, a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.

  4. Parallel External Memory Graph Algorithms

    DEFF Research Database (Denmark)

    Arge, Lars Allan; Goodrich, Michael T.; Sitchinava, Nodari

    2010-01-01

    In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one o f the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest...... an optimal speedup of ¿(P) in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts....

  5. PDDP, A Data Parallel Programming Model

    Directory of Open Access Journals (Sweden)

    Karen H. Warren

    1996-01-01

    Full Text Available PDDP, the parallel data distribution preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP implements high-performance Fortran-compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the WHERE construct. Distributed data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

  6. Parallel Programming with Intel Parallel Studio XE

    CERN Document Server

    Blair-Chappell , Stephen

    2012-01-01

    Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the

  7. Parallel programming with Easy Java Simulations

    Science.gov (United States)

    Esquembre, F.; Christian, W.; Belloni, M.

    2018-01-01

    Nearly all of today's processors are multicore, and ideally programming and algorithm development utilizing the entire processor should be introduced early in the computational physics curriculum. Parallel programming is often not introduced because it requires a new programming environment and uses constructs that are unfamiliar to many teachers. We describe how we decrease the barrier to parallel programming by using a java-based programming environment to treat problems in the usual undergraduate curriculum. We use the easy java simulations programming and authoring tool to create the program's graphical user interface together with objects based on those developed by Kaminsky [Building Parallel Programs (Course Technology, Boston, 2010)] to handle common parallel programming tasks. Shared-memory parallel implementations of physics problems, such as time evolution of the Schrödinger equation, are available as source code and as ready-to-run programs from the AAPT-ComPADRE digital library.

  8. Customizable Memory Schemes for Data Parallel Architectures

    NARCIS (Netherlands)

    Gou, C.

    2011-01-01

    Memory system efficiency is crucial for any processor to achieve high performance, especially in the case of data parallel machines. Processing capabilities of parallel lanes will be wasted, when data requests are not accomplished in a sustainable and timely manner. Irregular vector memory accesses

  9. PSHED: a simplified approach to developing parallel programs

    International Nuclear Information System (INIS)

    Mahajan, S.M.; Ramesh, K.; Rajesh, K.; Somani, A.; Goel, M.

    1992-01-01

    This paper presents a simplified approach in the forms of a tree structured computational model for parallel application programs. An attempt is made to provide a standard user interface to execute programs on BARC Parallel Processing System (BPPS), a scalable distributed memory multiprocessor. The interface package called PSHED provides a basic framework for representing and executing parallel programs on different parallel architectures. The PSHED package incorporates concepts from a broad range of previous research in programming environments and parallel computations. (author). 6 refs

  10. Parallel-Architecture Simulator Development Using Hardware Transactional Memory

    OpenAIRE

    Armejach Sanosa, Adrià

    2009-01-01

    To address the need for a simpler parallel programming model, Transactional Memory (TM) has been developed and promises good parallel performance with easy-to-write parallel code. Unlike lock-based approaches, with TM, programmers do not need to explicitly specify and manage the synchronization among threads. However, programmers simply mark code segments as transactions, and the TM system manages the concurrency control for them. TM can be implemented either in software (STM) or hardware (HT...

  11. Speedup predictions on large scientific parallel programs

    International Nuclear Information System (INIS)

    Williams, E.; Bobrowicz, F.

    1985-01-01

    How much speedup can we expect for large scientific parallel programs running on supercomputers. For insight into this problem we extend the parallel processing environment currently existing on the Cray X-MP (a shared memory multiprocessor with at most four processors) to a simulated N-processor environment, where N greater than or equal to 1. Several large scientific parallel programs from Los Alamos National Laboratory were run in this simulated environment, and speedups were predicted. A speedup of 14.4 on 16 processors was measured for one of the three most used codes at the Laboratory

  12. Portable parallel programming in a Fortran environment

    International Nuclear Information System (INIS)

    May, E.N.

    1989-01-01

    Experience using the Argonne-developed PARMACs macro package to implement a portable parallel programming environment is described. Fortran programs with intrinsic parallelism of coarse and medium granularity are easily converted to parallel programs which are portable among a number of commercially available parallel processors in the class of shared-memory bus-based and local-memory network based MIMD processors. The parallelism is implemented using standard UNIX (tm) tools and a small number of easily understood synchronization concepts (monitors and message-passing techniques) to construct and coordinate multiple cooperating processes on one or many processors. Benchmark results are presented for parallel computers such as the Alliant FX/8, the Encore MultiMax, the Sequent Balance, the Intel iPSC/2 Hypercube and a network of Sun 3 workstations. These parallel machines are typical MIMD types with from 8 to 30 processors, each rated at from 1 to 10 MIPS processing power. The demonstration code used for this work is a Monte Carlo simulation of the response to photons of a ''nearly realistic'' lead, iron and plastic electromagnetic and hadronic calorimeter, using the EGS4 code system. 6 refs., 2 figs., 2 tabs

  13. Programming massively parallel processors a hands-on approach

    CERN Document Server

    Kirk, David B

    2010-01-01

    Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs. Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA. The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who ...

  14. Parallelization for first principles electronic state calculation program

    International Nuclear Information System (INIS)

    Watanabe, Hiroshi; Oguchi, Tamio.

    1997-03-01

    In this report we study the parallelization for First principles electronic state calculation program. The target machines are NEC SX-4 for shared memory type parallelization and FUJITSU VPP300 for distributed memory type parallelization. The features of each parallel machine are surveyed, and the parallelization methods suitable for each are proposed. It is shown that 1.60 times acceleration is achieved with 2 CPU parallelization by SX-4 and 4.97 times acceleration is achieved with 12 PE parallelization by VPP 300. (author)

  15. Parallel programming with Python

    CERN Document Server

    Palach, Jan

    2014-01-01

    A fast, easy-to-follow and clear tutorial to help you develop Parallel computing systems using Python. Along with explaining the fundamentals, the book will also introduce you to slightly advanced concepts and will help you in implementing these techniques in the real world. If you are an experienced Python programmer and are willing to utilize the available computing resources by parallelizing applications in a simple way, then this book is for you. You are required to have a basic knowledge of Python development to get the most of this book.

  16. Implementing Shared Memory Parallelism in MCBEND

    Directory of Open Access Journals (Sweden)

    Bird Adam

    2017-01-01

    Full Text Available MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers’s ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.

  17. Practical parallel programming

    CERN Document Server

    Bauer, Barr E

    2014-01-01

    This is the book that will teach programmers to write faster, more efficient code for parallel processors. The reader is introduced to a vast array of procedures and paradigms on which actual coding may be based. Examples and real-life simulations using these devices are presented in C and FORTRAN.

  18. A Tutorial on Parallel and Concurrent Programming in Haskell

    Science.gov (United States)

    Peyton Jones, Simon; Singh, Satnam

    This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.

  19. Writing parallel programs that work

    CERN Multimedia

    CERN. Geneva

    2012-01-01

    Serial algorithms typically run inefficiently on parallel machines. This may sound like an obvious statement, but it is the root cause of why parallel programming is considered to be difficult. The current state of the computer industry is still that almost all programs in existence are serial. This talk will describe the techniques used in the Intel Parallel Studio to provide a developer with the tools necessary to understand the behaviors and limitations of the existing serial programs. Once the limitations are known the developer can refactor the algorithms and reanalyze the resulting programs with the tools in the Intel Parallel Studio to create parallel programs that work. About the speaker Paul Petersen is a Sr. Principal Engineer in the Software and Solutions Group (SSG) at Intel. He received a Ph.D. degree in Computer Science from the University of Illinois in 1993. After UIUC, he was employed at Kuck and Associates, Inc. (KAI) working on auto-parallelizing compiler (KAP), and was involved in th...

  20. The FORCE: A highly portable parallel programming language

    Science.gov (United States)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them.

  1. The FORCE - A highly portable parallel programming language

    Science.gov (United States)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.

  2. Programming parallel architectures - The BLAZE family of languages

    Science.gov (United States)

    Mehrotra, Piyush

    1989-01-01

    This paper gives an overview of the various approaches to programming multiprocessor architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive, since they remove much of the burden of exploiting parallel architectures from the user. This paper also describes recent work in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described.

  3. Pattern recognition with parallel associative memory

    Science.gov (United States)

    Toth, Charles K.; Schenk, Toni

    1990-01-01

    An examination is conducted of the feasibility of searching targets in aerial photographs by means of a parallel associative memory (PAM) that is based on the nearest-neighbor algorithm; the Hamming distance is used as a measure of closeness, in order to discriminate patterns. Attention has been given to targets typically used for ground-control points. The method developed sorts out approximate target positions where precise localizations are needed, in the course of the data-acquisition process. The majority of control points in different images were correctly identified.

  4. Distributed Memory Parallel Computing with SEAWAT

    Science.gov (United States)

    Verkaik, J.; Huizer, S.; van Engelen, J.; Oude Essink, G.; Ram, R.; Vuik, K.

    2017-12-01

    Fresh groundwater reserves in coastal aquifers are threatened by sea-level rise, extreme weather conditions, increasing urbanization and associated groundwater extraction rates. To counteract these threats, accurate high-resolution numerical models are required to optimize the management of these precious reserves. The major model drawbacks are long run times and large memory requirements, limiting the predictive power of these models. Distributed memory parallel computing is an efficient technique for reducing run times and memory requirements, where the problem is divided over multiple processor cores. A new Parallel Krylov Solver (PKS) for SEAWAT is presented. PKS has recently been applied to MODFLOW and includes Conjugate Gradient (CG) and Biconjugate Gradient Stabilized (BiCGSTAB) linear accelerators. Both accelerators are preconditioned by an overlapping additive Schwarz preconditioner in a way that: a) subdomains are partitioned using Recursive Coordinate Bisection (RCB) load balancing, b) each subdomain uses local memory only and communicates with other subdomains by Message Passing Interface (MPI) within the linear accelerator, c) it is fully integrated in SEAWAT. Within SEAWAT, the PKS-CG solver replaces the Preconditioned Conjugate Gradient (PCG) solver for solving the variable-density groundwater flow equation and the PKS-BiCGSTAB solver replaces the Generalized Conjugate Gradient (GCG) solver for solving the advection-diffusion equation. PKS supports the third-order Total Variation Diminishing (TVD) scheme for computing advection. Benchmarks were performed on the Dutch national supercomputer (https://userinfo.surfsara.nl/systems/cartesius) using up to 128 cores, for a synthetic 3D Henry model (100 million cells) and the real-life Sand Engine model ( 10 million cells). The Sand Engine model was used to investigate the potential effect of the long-term morphological evolution of a large sand replenishment and climate change on fresh groundwater resources

  5. Parallel discrete event simulation using shared memory

    Science.gov (United States)

    Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

    1988-01-01

    With traditional event-list techniques, evaluating a detailed discrete-event simulation-model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the numbers of processors. A set of shared-memory experiments, using the Chandy-Misra distributed-simulation algorithm, to simulate networks of queues is presented. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential-simulation of most queueing network models.

  6. Parallel phase model : a programming model for high-end parallel machines with manycores.

    Energy Technology Data Exchange (ETDEWEB)

    Wu, Junfeng (Syracuse University, Syracuse, NY); Wen, Zhaofang; Heroux, Michael Allen; Brightwell, Ronald Brian

    2009-04-01

    This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementation of parallel algorithms to exploit both the parallelism of the many cores and the parallelism at the cluster level. The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism. It includes a few high-level parallel programming language constructs that can be added as an extension to an existing (sequential or parallel) programming language such as C; and the implementation of PPM also includes a light-weight runtime library that runs on top of an existing network communication software layer (e.g. MPI). Design philosophy of PPM and details of the programming abstraction are also presented. Several unstructured applications that inherently require high-volume random fine-grained data accesses have been implemented in PPM with very promising results.

  7. Refinement of Parallel and Reactive Programs

    OpenAIRE

    Back, R. J. R.

    1992-01-01

    We show how to apply the refinement calculus to stepwise refinement of parallel and reactive programs. We use action systems as our basic program model. Action systems are sequential programs which can be implemented in a parallel fashion. Hence refinement calculus methods, originally developed for sequential programs, carry over to the derivation of parallel programs. Refinement of reactive programs is handled by data refinement techniques originally developed for the sequential refinement c...

  8. About Parallel Programming: Paradigms, Parallel Execution and Collaborative Systems

    Directory of Open Access Journals (Sweden)

    Loredana MOCEAN

    2009-01-01

    Full Text Available In the last years, there were made efforts for delineation of a stabile and unitary frame, where the problems of logical parallel processing must find solutions at least at the level of imperative languages. The results obtained by now are not at the level of the made efforts. This paper wants to be a little contribution at these efforts. We propose an overview in parallel programming, parallel execution and collaborative systems.

  9. Programming parallel architectures: The BLAZE family of languages

    Science.gov (United States)

    Mehrotra, Piyush

    1988-01-01

    Programming multiprocessor architectures is a critical research issue. An overview is given of the various approaches to programming these architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive since they remove much of the burden of exploiting parallel architectures from the user. Also described is recent work by the author in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described, as well as the relations of this work to other current language research projects.

  10. PRISMA/DB: A Parallel Main-Memory Relational DBMS

    NARCIS (Netherlands)

    Apers, Peter M.G.; Flokstra, Jan; van den Berg, Carel A.; Grefen, P.W.P.J.; Wilschut, A.N.; Kersten, Martin L.; van den Berg, C.A.

    1992-01-01

    PRISMA/DB, a full-fledged parallel, main memory relational database management system (DBMS) is described. PRISMA/DB's high performance is obtained by the use of parallelism for query processing and main memory storage of the entire database. A flexible architecture for experimenting with

  11. Structured Parallel Programming Patterns for Efficient Computation

    CERN Document Server

    McCool, Michael; Robison, Arch

    2012-01-01

    Programming is now parallel programming. Much as structured programming revolutionized traditional serial programming decades ago, a new kind of structured programming, based on patterns, is relevant to parallel programming today. Parallel computing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of th

  12. Compiling Scientific Programs for Scalable Parallel Systems

    National Research Council Canada - National Science Library

    Kennedy, Ken

    2001-01-01

    ...). The research performed in this project included new techniques for recognizing implicit parallelism in sequential programs, a powerful and precise set-based framework for analysis and transformation...

  13. Shared memory parallelism for 3D cartesian discrete ordinates solver

    International Nuclear Information System (INIS)

    Moustafa, S.; Dutka-Malen, I.; Plagne, L.; Poncot, A.; Ramet, P.

    2013-01-01

    This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multi-core + SIMD - Single Instruction on Multiple Data) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46*10 6 spatial cells and 1*10 12 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40.74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool. (authors)

  14. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs

    Directory of Open Access Journals (Sweden)

    Stephen L. Olivier

    2013-01-01

    Full Text Available Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.

  15. A Programming Model for Massive Data Parallelism with Data Dependencies

    International Nuclear Information System (INIS)

    Cui, Xiaohui; Mueller, Frank; Potok, Thomas E.; Zhang, Yongpeng

    2009-01-01

    Accelerating processors can often be more cost and energy effective for a wide range of data-parallel computing problems than general-purpose processors. For graphics processor units (GPUs), this is particularly the case when program development is aided by environments such as NVIDIA s Compute Unified Device Architecture (CUDA), which dramatically reduces the gap between domain-specific architectures and general purpose programming. Nonetheless, general-purpose GPU (GPGPU) programming remains subject to several restrictions. Most significantly, the separation of host (CPU) and accelerator (GPU) address spaces requires explicit management of GPU memory resources, especially for massive data parallelism that well exceeds the memory capacity of GPUs. One solution to this problem is to transfer data between the GPU and host memories frequently. In this work, we investigate another approach. We run massively data-parallel applications on GPU clusters. We further propose a programming model for massive data parallelism with data dependencies for this scenario. Experience from micro benchmarks and real-world applications shows that our model provides not only ease of programming but also significant performance gains

  16. Massively Parallel Finite Element Programming

    KAUST Repository

    Heister, Timo

    2010-01-01

    Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.

  17. Massively Parallel Finite Element Programming

    KAUST Repository

    Heister, Timo; Kronbichler, Martin; Bangerth, Wolfgang

    2010-01-01

    Today's large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.

  18. Language constructs for modular parallel programs

    Energy Technology Data Exchange (ETDEWEB)

    Foster, I.

    1996-03-01

    We describe programming language constructs that facilitate the application of modular design techniques in parallel programming. These constructs allow us to isolate resource management and processor scheduling decisions from the specification of individual modules, which can themselves encapsulate design decisions concerned with concurrence, communication, process mapping, and data distribution. This approach permits development of libraries of reusable parallel program components and the reuse of these components in different contexts. In particular, alternative mapping strategies can be explored without modifying other aspects of program logic. We describe how these constructs are incorporated in two practical parallel programming languages, PCN and Fortran M. Compilers have been developed for both languages, allowing experimentation in substantial applications.

  19. Experiences in Data-Parallel Programming

    Directory of Open Access Journals (Sweden)

    Terry W. Clark

    1997-01-01

    Full Text Available To efficiently parallelize a scientific application with a data-parallel compiler requires certain structural properties in the source program, and conversely, the absence of others. A recent parallelization effort of ours reinforced this observation and motivated this correspondence. Specifically, we have transformed a Fortran 77 version of GROMOS, a popular dusty-deck program for molecular dynamics, into Fortran D, a data-parallel dialect of Fortran. During this transformation we have encountered a number of difficulties that probably are neither limited to this particular application nor do they seem likely to be addressed by improved compiler technology in the near future. Our experience with GROMOS suggests a number of points to keep in mind when developing software that may at some time in its life cycle be parallelized with a data-parallel compiler. This note presents some guidelines for engineering data-parallel applications that are compatible with Fortran D or High Performance Fortran compilers.

  20. Parallel structures in human and computer memory

    Science.gov (United States)

    Kanerva, Pentti

    1986-08-01

    If we think of our experiences as being recorded continuously on film, then human memory can be compared to a film library that is indexed by the contents of the film strips stored in it. Moreover, approximate retrieval cues suffice to retrieve information stored in this library: We recognize a familiar person in a fuzzy photograph or a familiar tune played on a strange instrument. This paper is about how to construct a computer memory that would allow a computer to recognize patterns and to recall sequences the way humans do. Such a memory is remarkably similar in structure to a conventional computer memory and also to the neural circuits in the cortex of the cerebellum of the human brain. The paper concludes that the frame problem of artificial intelligence could be solved by the use of such a memory if we were able to encode information about the world properly.

  1. Mapping robust parallel multigrid algorithms to scalable memory architectures

    Science.gov (United States)

    Overman, Andrea; Vanrosendale, John

    1993-01-01

    The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid (MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. This paper describes a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for distributed-memory machines.

  2. Productive Parallel Programming: The PCN Approach

    Directory of Open Access Journals (Sweden)

    Ian Foster

    1992-01-01

    Full Text Available We describe the PCN programming system, focusing on those features designed to improve the productivity of scientists and engineers using parallel supercomputers. These features include a simple notation for the concise specification of concurrent algorithms, the ability to incorporate existing Fortran and C code into parallel applications, facilities for reusing parallel program components, a portable toolkit that allows applications to be developed on a workstation or small parallel computer and run unchanged on supercomputers, and integrated debugging and performance analysis tools. We survey representative scientific applications and identify problem classes for which PCN has proved particularly useful.

  3. Parallelization for X-ray crystal structural analysis program

    Energy Technology Data Exchange (ETDEWEB)

    Watanabe, Hiroshi [Japan Atomic Energy Research Inst., Tokyo (Japan); Minami, Masayuki; Yamamoto, Akiji

    1997-10-01

    In this report we study vectorization and parallelization for X-ray crystal structural analysis program. The target machine is NEC SX-4 which is a distributed/shared memory type vector parallel supercomputer. X-ray crystal structural analysis is surveyed, and a new multi-dimensional discrete Fourier transform method is proposed. The new method is designed to have a very long vector length, so that it enables to obtain the 12.0 times higher performance result that the original code. Besides the above-mentioned vectorization, the parallelization by micro-task functions on SX-4 reaches 13.7 times acceleration in the part of multi-dimensional discrete Fourier transform with 14 CPUs, and 3.0 times acceleration in the whole program. Totally 35.9 times acceleration to the original 1CPU scalar version is achieved with vectorization and parallelization on SX-4. (author)

  4. Parallel adaptation of a vectorised quantumchemical program system

    International Nuclear Information System (INIS)

    Van Corler, L.C.H.; Van Lenthe, J.H.

    1987-01-01

    Supercomputers, like the CRAY 1 or the Cyber 205, have had, and still have, a marked influence on Quantum Chemistry. Vectorization has led to a considerable increase in the performance of Quantum Chemistry programs. However, clockcycle times more than a factor 10 smaller than those of the present supercomputers are not to be expected. Therefore future supercomputers will have to depend on parallel structures. Recently, the first examples of such supercomputers have been installed. To be prepared for this new generation of (parallel) supercomputers one should consider the concepts one wants to use and the kind of problems one will encounter during implementation of existing vectorized programs on those parallel systems. The authors implemented four important parts of a large quantumchemical program system (ATMOL), i.e. integrals, SCF, 4-index and Direct-CI in the parallel environment at ECSEC (Rome, Italy). This system offers simulated parallellism on the host computer (IBM 4381) and real parallellism on at most 10 attached processors (FPS-164). Quantumchemical programs usually handle large amounts of data and very large, often sparse matrices. The transfer of that many data can cause problems concerning communication and overhead, in view of which shared memory and shared disks must be considered. The strategy and the tools that were used to parallellise the programs are shown. Also, some examples are presented to illustrate effectiveness and performance of the system in Rome for these type of calculations

  5. Parallel processor programs in the Federal Government

    Science.gov (United States)

    Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

    1985-01-01

    In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

  6. A compositional reservoir simulator on distributed memory parallel computers

    International Nuclear Information System (INIS)

    Rame, M.; Delshad, M.

    1995-01-01

    This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. A portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented

  7. Comparative Evaluation and Case Studies of Shared-Memory and Data-Parallel Execution Patterns

    Directory of Open Access Journals (Sweden)

    Xiaodong Zhang

    1999-01-01

    Full Text Available Shared‐memory and data‐parallel programming models are two important paradigms for scientific applications. Both models provide high‐level program abstractions, and simple and uniform views of network structures. The common features of the two models significantly simplify program coding and debugging for scientific applications. However, the underlining execution and overhead patterns are significantly different between the two models due to their programming constraints, and due to different and complex structures of interconnection networks and systems which support the two models. We performed this experimental study to present implications and comparisons of execution patterns on two commercial architectures. We implemented a standard electromagnetic simulation program (EM and a linear system solver using the shared‐memory model on the KSR‐1 and the data‐parallel model on the CM‐5. Our objectives are to examine the execution pattern changes required for an implementation transformation between the two models; to study memory access patterns; to address scalability issues; and to investigate relative costs and advantages/disadvantages of using the two models for scientific computations. Our results indicate that the EM program tends to become computation‐intensive in the KSR‐1 shared‐memory system, and memory‐demanding in the CM‐5 data‐parallel system when the systems and the problems are scaled. The EM program, a highly data‐parallel program performed extremely well, and the linear system solver, a highly control‐structured program suffered significantly in the data‐parallel model on the CM‐5. Our study provides further evidence that matching execution patterns of algorithms to parallel architectures would achieve better performance.

  8. Enhanced memory architecture for massively parallel vision chip

    Science.gov (United States)

    Chen, Zhe; Yang, Jie; Liu, Liyuan; Wu, Nanjian

    2015-04-01

    Local memory architecture plays an important role in high performance massively parallel vision chip. In this paper, we propose an enhanced memory architecture with compact circuit area designed in a full-custom flow. The memory consists of separate master-stage static latches and shared slave-stage dynamic latches. We use split transmission transistors on the input data path to enhance tolerance for charge sharing and to achieve random read/write capabilities. The memory is designed in a 0.18 μm CMOS process. The area overhead of the memory achieves 16.6 μm2/bit. Simulation results show that the maximum operating frequency reaches 410 MHz and the corresponding peak dynamic power consumption for a 64-bit memory unit is 190 μW under 1.8 V supply voltage.

  9. The kpx, a program analyzer for parallelization

    International Nuclear Information System (INIS)

    Matsuyama, Yuji; Orii, Shigeo; Ota, Toshiro; Kume, Etsuo; Aikawa, Hiroshi.

    1997-03-01

    The kpx is a program analyzer, developed as a common technological basis for promoting parallel processing. The kpx consists of three tools. The first is ktool, that shows how much execution time is spent in program segments. The second is ptool, that shows parallelization overhead on the Paragon system. The last is xtool, that shows parallelization overhead on the VPP system. The kpx, designed to work for any FORTRAN cord on any UNIX computer, is confirmed to work well after testing on Paragon, SP2, SR2201, VPP500, VPP300, Monte-4, SX-4 and T90. (author)

  10. Automatic Parallelization Tool: Classification of Program Code for Parallel Computing

    Directory of Open Access Journals (Sweden)

    Mustafa Basthikodi

    2016-04-01

    Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.

  11. Dynamic overset grid communication on distributed memory parallel processors

    Science.gov (United States)

    Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.

    1993-01-01

    A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.

  12. From sequential to parallel programming with patterns

    CERN Document Server

    CERN. Geneva

    2018-01-01

    To increase in both performance and efficiency, our programming models need to adapt to better exploit modern processors. The classic idioms and patterns for programming such as loops, branches or recursion are the pillars of almost every code and are well known among all programmers. These patterns all have in common that they are sequential in nature. Embracing parallel programming patterns, which allow us to program for multi- and many-core hardware in a natural way, greatly simplifies the task of designing a program that scales and performs on modern hardware, independently of the used programming language, and in a generic way.

  13. Testing New Programming Paradigms with NAS Parallel Benchmarks

    Science.gov (United States)

    Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

    2000-01-01

    Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage

  14. Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

    Science.gov (United States)

    Blocksome, Michael A.; Mamidala, Amith R.

    2013-09-03

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  15. Parallel pathways for cross-modal memory retrieval in Drosophila.

    Science.gov (United States)

    Zhang, Xiaonan; Ren, Qingzhong; Guo, Aike

    2013-05-15

    Memory-retrieval processing of cross-modal sensory preconditioning is vital for understanding the plasticity underlying the interactions between modalities. As part of the sensory preconditioning paradigm, it has been hypothesized that the conditioned response to an unreinforced cue depends on the memory of the reinforced cue via a sensory link between the two cues. To test this hypothesis, we studied cross-modal memory-retrieval processing in a genetically tractable model organism, Drosophila melanogaster. By expressing the dominant temperature-sensitive shibire(ts1) (shi(ts1)) transgene, which blocks synaptic vesicle recycling of specific neural subsets with the Gal4/UAS system at the restrictive temperature, we specifically blocked visual and olfactory memory retrieval, either alone or in combination; memory acquisition remained intact for these modalities. Blocking the memory retrieval of the reinforced olfactory cues did not impair the conditioned response to the unreinforced visual cues or vice versa, in contrast to the canonical memory-retrieval processing of sensory preconditioning. In addition, these conditioned responses can be abolished by blocking the memory retrieval of the two modalities simultaneously. In sum, our results indicated that a conditioned response to an unreinforced cue in cross-modal sensory preconditioning can be recalled through parallel pathways.

  16. On-line event reconstruction using a parallel in-memory data base

    OpenAIRE

    Argante, E; Van der Stok, P D V; Willers, Ian Malcolm

    1995-01-01

    PORS is a system designed for on-line event reconstruction in high energy physics (HEP) experiments. It uses the CPREAD reconstruction program. Central to the system is a parallel in-memory database which is used as communication medium between parallel workers. A farming control structure is implemented with PORS in a natural way. The database provides structured storage of data with a short life time. PORS serves as a case study for the construction of a methodology on how to apply parallel...

  17. A Parallel Saturation Algorithm on Shared Memory Architectures

    Science.gov (United States)

    Ezekiel, Jonathan; Siminiceanu

    2007-01-01

    Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.

  18. Adaptive Dynamic Process Scheduling on Distributed Memory Parallel Computers

    Directory of Open Access Journals (Sweden)

    Wei Shu

    1994-01-01

    Full Text Available One of the challenges in programming distributed memory parallel machines is deciding how to allocate work to processors. This problem is particularly important for computations with unpredictable dynamic behaviors or irregular structures. We present a scheme for dynamic scheduling of medium-grained processes that is useful in this context. The adaptive contracting within neighborhood (ACWN is a dynamic, distributed, load-dependent, and scalable scheme. It deals with dynamic and unpredictable creation of processes and adapts to different systems. The scheme is described and contrasted with two other schemes that have been proposed in this context, namely the randomized allocation and the gradient model. The performance of the three schemes on an Intel iPSC/2 hypercube is presented and analyzed. The experimental results show that even though the ACWN algorithm incurs somewhat larger overhead than the randomized allocation, it achieves better performance in most cases due to its adaptiveness. Its feature of quickly spreading the work helps it outperform the gradient model in performance and scalability.

  19. Parallel Volunteer Learning during Youth Programs

    Science.gov (United States)

    Lesmeister, Marilyn K.; Green, Jeremy; Derby, Amy; Bothum, Candi

    2012-01-01

    Lack of time is a hindrance for volunteers to participate in educational opportunities, yet volunteer success in an organization is tied to the orientation and education they receive. Meeting diverse educational needs of volunteers can be a challenge for program managers. Scheduling a Volunteer Learning Track for chaperones that is parallel to a…

  20. Contributions to computational stereology and parallel programming

    DEFF Research Database (Denmark)

    Rasmusson, Allan

    rotator, even without the need for isotropic sections. To meet the need for computational power to perform image restoration of virtual tissue sections, parallel programming on GPUs has also been part of the project. This has lead to a significant change in paradigm for a previously developed surgical...

  1. Parallel Breadth-First Search on Distributed Memory Systems

    Energy Technology Data Exchange (ETDEWEB)

    Computational Research Division; Buluc, Aydin; Madduri, Kamesh

    2011-04-15

    Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned par- allel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix- partitioning-based approach that mitigates parallel commu- nication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execu- tion regimes in which these approaches will be competitive, and we demonstrate extremely high performance on lead- ing distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny- Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.

  2. Program For Parallel Discrete-Event Simulation

    Science.gov (United States)

    Beckman, Brian C.; Blume, Leo R.; Geiselman, John S.; Presley, Matthew T.; Wedel, John J., Jr.; Bellenot, Steven F.; Diloreto, Michael; Hontalas, Philip J.; Reiher, Peter L.; Weiland, Frederick P.

    1991-01-01

    User does not have to add any special logic to aid in synchronization. Time Warp Operating System (TWOS) computer program is special-purpose operating system designed to support parallel discrete-event simulation. Complete implementation of Time Warp mechanism. Supports only simulations and other computations designed for virtual time. Time Warp Simulator (TWSIM) subdirectory contains sequential simulation engine interface-compatible with TWOS. TWOS and TWSIM written in, and support simulations in, C programming language.

  3. Monte Carlo photon transport on shared memory and distributed memory parallel processors

    International Nuclear Information System (INIS)

    Martin, W.R.; Wan, T.C.; Abdel-Rahman, T.S.; Mudge, T.N.; Miura, K.

    1987-01-01

    Parallelized Monte Carlo algorithms for analyzing photon transport in an inertially confined fusion (ICF) plasma are considered. Algorithms were developed for shared memory (vector and scalar) and distributed memory (scalar) parallel processors. The shared memory algorithm was implemented on the IBM 3090/400, and timing results are presented for dedicated runs with two, three, and four processors. Two alternative distributed memory algorithms (replication and dispatching) were implemented on a hypercube parallel processor (1 through 64 nodes). The replication algorithm yields essentially full efficiency for all cube sizes; with the 64-node configuration, the absolute performance is nearly the same as with the CRAY X-MP. The dispatching algorithm also yields efficiencies above 80% in a large simulation for the 64-processor configuration

  4. Parallel implementation of the PHOENIX generalized stellar atmosphere program. II. Wavelength parallelization

    International Nuclear Information System (INIS)

    Baron, E.; Hauschildt, Peter H.

    1998-01-01

    We describe an important addition to the parallel implementation of our generalized nonlocal thermodynamic equilibrium (NLTE) stellar atmosphere and radiative transfer computer program PHOENIX. In a previous paper in this series we described data and task parallel algorithms we have developed for radiative transfer, spectral line opacity, and NLTE opacity and rate calculations. These algorithms divided the work spatially or by spectral lines, that is, distributing the radial zones, individual spectral lines, or characteristic rays among different processors and employ, in addition, task parallelism for logically independent functions (such as atomic and molecular line opacities). For finite, monotonic velocity fields, the radiative transfer equation is an initial value problem in wavelength, and hence each wavelength point depends upon the previous one. However, for sophisticated NLTE models of both static and moving atmospheres needed to accurately describe, e.g., novae and supernovae, the number of wavelength points is very large (200,000 - 300,000) and hence parallelization over wavelength can lead both to considerable speedup in calculation time and the ability to make use of the aggregate memory available on massively parallel supercomputers. Here, we describe an implementation of a pipelined design for the wavelength parallelization of PHOENIX, where the necessary data from the processor working on a previous wavelength point is sent to the processor working on the succeeding wavelength point as soon as it is known. Our implementation uses a MIMD design based on a relatively small number of standard message passing interface (MPI) library calls and is fully portable between serial and parallel computers. copyright 1998 The American Astronomical Society

  5. A portable implementation of ARPACK for distributed memory parallel architectures

    Energy Technology Data Exchange (ETDEWEB)

    Maschhoff, K.J.; Sorensen, D.C.

    1996-12-31

    ARPACK is a package of Fortran 77 subroutines which implement the Implicitly Restarted Arnoldi Method used for solving large sparse eigenvalue problems. A parallel implementation of ARPACK is presented which is portable across a wide range of distributed memory platforms and requires minimal changes to the serial code. The communication layers used for message passing are the Basic Linear Algebra Communication Subprograms (BLACS) developed for the ScaLAPACK project and Message Passing Interface(MPI).

  6. Parallel discrete ordinates algorithms on distributed and common memory systems

    International Nuclear Information System (INIS)

    Wienke, B.R.; Hiromoto, R.E.; Brickner, R.G.

    1987-01-01

    The S/sub n/ algorithm employs iterative techniques in solving the linear Boltzmann equation. These methods, both ordered and chaotic, were compared on both the Denelcor HEP and the Intel hypercube. Strategies are linked to the organization and accessibility of memory (common memory versus distributed memory architectures), with common concern for acquisition of global information. Apart from this, the inherent parallelism of the algorithm maps directly onto the two architectures. Results comparing execution times, speedup, and efficiency are based on a representative 16-group (full upscatter and downscatter) sample problem. Calculations were performed on both the Los Alamos National Laboratory (LANL) Denelcor HEP and the LANL Intel hypercube. The Denelcor HEP is a 64-bit multi-instruction, multidate MIMD machine consisting of up to 16 process execution modules (PEMs), each capable of executing 64 processes concurrently. Each PEM can cooperate on a job, or run several unrelated jobs, and share a common global memory through a crossbar switch. The Intel hypercube, on the other hand, is a distributed memory system composed of 128 processing elements, each with its own local memory. Processing elements are connected in a nearest-neighbor hypercube configuration and sharing of data among processors requires execution of explicit message-passing constructs

  7. The specificity of learned parallelism in dual-memory retrieval.

    Science.gov (United States)

    Strobach, Tilo; Schubert, Torsten; Pashler, Harold; Rickard, Timothy

    2014-05-01

    Retrieval of two responses from one visually presented cue occurs sequentially at the outset of dual-retrieval practice. Exclusively for subjects who adopt a mode of grouping (i.e., synchronizing) their response execution, however, reaction times after dual-retrieval practice indicate a shift to learned retrieval parallelism (e.g., Nino & Rickard, in Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 373-388, 2003). In the present study, we investigated how this learned parallelism is achieved and why it appears to occur only for subjects who group their responses. Two main accounts were considered: a task-level versus a cue-level account. The task-level account assumes that learned retrieval parallelism occurs at the level of the task as a whole and is not limited to practiced cues. Grouping response execution may thus promote a general shift to parallel retrieval following practice. The cue-level account states that learned retrieval parallelism is specific to practiced cues. This type of parallelism may result from cue-specific response chunking that occurs uniquely as a consequence of grouped response execution. The results of two experiments favored the second account and were best interpreted in terms of a structural bottleneck model.

  8. Shared Memory Parallelization of an Implicit ADI-type CFD Code

    Science.gov (United States)

    Hauser, Th.; Huang, P. G.

    1999-01-01

    A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.

  9. Emotional stimuli exert parallel effects on attention and memory.

    Science.gov (United States)

    Talmi, Deborah; Ziegler, Marilyne; Hawksworth, Jade; Lalani, Safina; Herman, C Peter; Moscovitch, Morris

    2013-01-01

    Because emotional and neutral stimuli typically differ on non-emotional dimensions, it has been difficult to determine conclusively which factors underlie the ability of emotional stimuli to enhance immediate long-term memory. Here we induced arousal by varying participants' goals, a method that removes many potential confounds between emotional and non-emotional items. Hungry and sated participants encoded food and clothing images under divided attention conditions. Sated participants attended to and recalled food and clothing images equivalently. Hungry participants performed worse on the concurrent tone-discrimination task when they viewed food relative to clothing images, suggesting enhanced attention to food images, and they recalled more food than clothing images. A follow-up regression analysis of the factors predicting memory for individual pictures revealed that food images had parallel effects on attention and memory in hungry participants, so that enhanced attention to food images did not predict their enhanced memory. We suggest that immediate long-term memory for food is enhanced in the hungry state because hunger leads to more distinctive processing of food images rendering them more accessible during retrieval.

  10. Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

    Energy Technology Data Exchange (ETDEWEB)

    Jin, Shuangshuang; Chen, Yousu; Wu, Di; Diao, Ruisheng; Huang, Zhenyu

    2015-12-09

    Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Message Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.

  11. Parallel SN algorithms in shared- and distributed-memory environments

    International Nuclear Information System (INIS)

    Haghighat, Alireza; Hunter, Melissa A.; Mattis, Ronald E.

    1995-01-01

    Different 2-D spatial domain partitioning Sn transport theory algorithms have been developed on the basis of the Block-Jacobi iterative scheme. These algorithms have been incorporated into TWOTRAN-II, and tested on a shared-memory CRAY Y-MP C90 and a distributed-memory IBM SP1. For a series of fixed source r-z geometry homogeneous problems, parallel efficiencies in a range of 50-90% are achieved on the C90 with 6 processors, and lower values (20-60%) are obtained on the SP1. It is demonstrated that better performance is attainable if one addresses issues such as convergence rate, load-balancing, and granularity for both architectures, as well as message passing (network bandwidth and latency) for SP1. (author). 17 refs, 4 figs

  12. Paging memory from random access memory to backing storage in a parallel computer

    Science.gov (United States)

    Archer, Charles J; Blocksome, Michael A; Inglett, Todd A; Ratterman, Joseph D; Smith, Brian E

    2013-05-21

    Paging memory from random access memory (`RAM`) to backing storage in a parallel computer that includes a plurality of compute nodes, including: executing a data processing application on a virtual machine operating system in a virtual machine on a first compute node; providing, by a second compute node, backing storage for the contents of RAM on the first compute node; and swapping, by the virtual machine operating system in the virtual machine on the first compute node, a page of memory from RAM on the first compute node to the backing storage on the second compute node.

  13. Parallel discrete event simulation: A shared memory approach

    Science.gov (United States)

    Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

    1987-01-01

    With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models.

  14. Parallel k-means++ for Multiple Shared-Memory Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Mackey, Patrick S.; Lewis, Robert R.

    2016-09-22

    In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varying data sizes.

  15. Particle simulation on a distributed memory highly parallel processor

    International Nuclear Information System (INIS)

    Sato, Hiroyuki; Ikesaka, Morio

    1990-01-01

    This paper describes parallel molecular dynamics simulation of atoms governed by local force interaction. The space in the model is divided into cubic subspaces and mapped to the processor array of the CAP-256, a distributed memory, highly parallel processor developed at Fujitsu Labs. We developed a new technique to avoid redundant calculation of forces between atoms in different processors. Experiments showed the communication overhead was less than 5%, and the idle time due to load imbalance was less than 11% for two model problems which contain 11,532 and 46,128 argon atoms. From the software simulation, the CAP-II which is under development is estimated to be about 45 times faster than CAP-256 and will be able to run the same problem about 40 times faster than Fujitsu's M-380 mainframe when 256 processors are used. (author)

  16. Evolution of a minimal parallel programming model

    International Nuclear Information System (INIS)

    Lusk, Ewing; Butler, Ralph; Pieper, Steven C.

    2017-01-01

    Here, we take a historical approach to our presentation of self-scheduled task parallelism, a programming model with its origins in early irregular and nondeterministic computations encountered in automated theorem proving and logic programming. We show how an extremely simple task model has evolved into a system, asynchronous dynamic load balancing (ADLB), and a scalable implementation capable of supporting sophisticated applications on today’s (and tomorrow’s) largest supercomputers; and we illustrate the use of ADLB with a Green’s function Monte Carlo application, a modern, mature nuclear physics code in production use. Our lesson is that by surrendering a certain amount of generality and thus applicability, a minimal programming model (in terms of its basic concepts and the size of its application programmer interface) can achieve extreme scalability without introducing complexity.

  17. Parallelization and checkpointing of GPU applications through program transformation

    Energy Technology Data Exchange (ETDEWEB)

    Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)

    2012-01-01

    GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and

  18. Event parallelism: Distributed memory parallel computing for high energy physics experiments

    International Nuclear Information System (INIS)

    Nash, T.

    1989-05-01

    This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. 6 figs

  19. Event parallelism: Distributed memory parallel computing for high energy physics experiments

    International Nuclear Information System (INIS)

    Nash, T.

    1989-01-01

    This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC systems, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described. (orig.)

  20. Event parallelism: Distributed memory parallel computing for high energy physics experiments

    Science.gov (United States)

    Nash, Thomas

    1989-12-01

    This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.

  1. The parallel processing of EGS4 code on distributed memory scalar parallel computer:Intel Paragon XP/S15-256

    Energy Technology Data Exchange (ETDEWEB)

    Takemiya, Hiroshi; Ohta, Hirofumi; Honma, Ichirou

    1996-03-01

    The parallelization of Electro-Magnetic Cascade Monte Carlo Simulation Code, EGS4 on distributed memory scalar parallel computer: Intel Paragon XP/S15-256 is described. EGS4 has the feature that calculation time for one incident particle is quite different from each other because of the dynamic generation of secondary particles and different behavior of each particle. Granularity for parallel processing, parallel programming model and the algorithm of parallel random number generation are discussed and two kinds of method, each of which allocates particles dynamically or statically, are used for the purpose of realizing high speed parallel processing of this code. Among four problems chosen for performance evaluation, the speedup factors for three problems have been attained to nearly 100 times with 128 processor. It has been found that when both the calculation time for each incident particles and its dispersion are large, it is preferable to use dynamic particle allocation method which can average the load for each processor. And it has also been found that when they are small, it is preferable to use static particle allocation method which reduces the communication overhead. Moreover, it is pointed out that to get the result accurately, it is necessary to use double precision variables in EGS4 code. Finally, the workflow of program parallelization is analyzed and tools for program parallelization through the experience of the EGS4 parallelization are discussed. (author).

  2. Development of parallel/serial program analyzing tool

    International Nuclear Information System (INIS)

    Watanabe, Hiroshi; Nagao, Saichi; Takigawa, Yoshio; Kumakura, Toshimasa

    1999-03-01

    Japan Atomic Energy Research Institute has been developing 'KMtool', a parallel/serial program analyzing tool, in order to promote the parallelization of the science and engineering computation program. KMtool analyzes the performance of program written by FORTRAN77 and MPI, and it reduces the effort for parallelization. This paper describes development purpose, design, utilization and evaluation of KMtool. (author)

  3. Parallel grid generation algorithm for distributed memory computers

    Science.gov (United States)

    Moitra, Stuti; Moitra, Anutosh

    1994-01-01

    A parallel grid-generation algorithm and its implementation on the Intel iPSC/860 computer are described. The grid-generation scheme is based on an algebraic formulation of homotopic relations. Methods for utilizing the inherent parallelism of the grid-generation scheme are described, and implementation of multiple levELs of parallelism on multiple instruction multiple data machines are indicated. The algorithm is capable of providing near orthogonality and spacing control at solid boundaries while requiring minimal interprocessor communications. Results obtained on the Intel hypercube for a blended wing-body configuration are used to demonstrate the effectiveness of the algorithm. Fortran implementations bAsed on the native programming model of the iPSC/860 computer and the Express system of software tools are reported. Computational gains in execution time speed-up ratios are given.

  4. A parallelization study of the general purpose Monte Carlo code MCNP4 on a distributed memory highly parallel computer

    International Nuclear Information System (INIS)

    Yamazaki, Takao; Fujisaki, Masahide; Okuda, Motoi; Takano, Makoto; Masukawa, Fumihiro; Naito, Yoshitaka

    1993-01-01

    The general purpose Monte Carlo code MCNP4 has been implemented on the Fujitsu AP1000 distributed memory highly parallel computer. Parallelization techniques developed and studied are reported. A shielding analysis function of the MCNP4 code is parallelized in this study. A technique to map a history to each processor dynamically and to map control process to a certain processor was applied. The efficiency of parallelized code is up to 80% for a typical practical problem with 512 processors. These results demonstrate the advantages of a highly parallel computer to the conventional computers in the field of shielding analysis by Monte Carlo method. (orig.)

  5. Massively Parallel Polar Decomposition on Distributed-Memory Systems

    KAUST Repository

    Ltaief, Hatem

    2018-01-01

    We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation for the scalar sign function, which also corresponds to the polar factor for symmetric matrices, to further accelerate the QDWH convergence. Based on the Zolotarev rational functions—introduced by Zolotarev (ZOLO) in 1877— this new PD algorithm ZOLO-PD converges within two iterations even for ill-conditioned matrices, instead of the original six iterations needed for QDWH. ZOLO-PD uses the property of Zolotarev functions that optimality is maintained when two functions are composed in an appropriate manner. The resulting ZOLO-PD has a convergence rate up to seventeen, in contrast to the cubic convergence rate for QDWH. This comes at the price of higher arithmetic costs and memory footprint. These extra floating-point operations can, however, be processed in an embarrassingly parallel fashion. We demonstrate performance using up to 102, 400 cores on two supercomputers. We demonstrate that, in the presence of a large number of processing units, ZOLO-PD is able to outperform QDWH by up to 2.3X speedup, especially in situations where QDWH runs out of work, for instance, in the strong scaling mode of operation.

  6. Step by step parallel programming method for molecular dynamics code

    International Nuclear Information System (INIS)

    Orii, Shigeo; Ohta, Toshio

    1996-07-01

    Parallel programming for a numerical simulation program of molecular dynamics is carried out with a step-by-step programming technique using the two phase method. As a result, within the range of a certain computing parameters, it is found to obtain parallel performance by using the level of parallel programming which decomposes the calculation according to indices of do-loops into each processor on the vector parallel computer VPP500 and the scalar parallel computer Paragon. It is also found that VPP500 shows parallel performance in wider range computing parameters. The reason is that the time cost of the program parts, which can not be reduced by the do-loop level of the parallel programming, can be reduced to the negligible level by the vectorization. After that, the time consuming parts of the program are concentrated on less parts that can be accelerated by the do-loop level of the parallel programming. This report shows the step-by-step parallel programming method and the parallel performance of the molecular dynamics code on VPP500 and Paragon. (author)

  7. Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

    Science.gov (United States)

    Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

    2015-09-01

    The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

  8. Concurrent computation of attribute filters on shared memory parallel machines

    NARCIS (Netherlands)

    Wilkinson, Michael H.F.; Gao, Hui; Hesselink, Wim H.; Jonker, Jan-Eppo; Meijster, Arnold

    2008-01-01

    Morphological attribute filters have not previously been parallelized mainly because they are both global and nonseparable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings, and thickenings,

  9. Three dimensional Burn-up program parallelization using socket programming

    International Nuclear Information System (INIS)

    Haliyati R, Evi; Su'ud, Zaki

    2002-01-01

    A computer parallelization process was built with a purpose to decrease execution time of a physics program. In this case, a multi computer system was built to be used to analyze burn-up process of a nuclear reactor. This multi computer system was design need using a protocol communication among sockets, i.e. TCP/IP. This system consists of computer as a server and the rest as clients. The server has a main control to all its clients. The server also divides the reactor core geometrically to in parts in accordance with the number of clients, each computer including the server has a task to conduct burn-up analysis of 1/n part of the total reactor core measure. This burn-up analysis was conducted simultaneously and in a parallel way by all computers, so a faster program execution time was achieved close to 1/n times that of one computer. Then an analysis was carried out and states that in order to calculate the density of atoms in a reactor of 91 cm x 91 cm x 116 cm, the usage of a parallel system of 2 computers has the highest efficiency

  10. Assessing Programming Costs of Explicit Memory Localization on a Large Scale Shared Memory Multiprocessor

    Directory of Open Access Journals (Sweden)

    Silvio Picano

    1992-01-01

    Full Text Available We present detailed experimental work involving a commercially available large scale shared memory multiple instruction stream-multiple data stream (MIMD parallel computer having a software controlled cache coherence mechanism. To make effective use of such an architecture, the programmer is responsible for designing the program's structure to match the underlying multiprocessors capabilities. We describe the techniques used to exploit our multiprocessor (the BBN TC2000 on a network simulation program, showing the resulting performance gains and the associated programming costs. We show that an efficient implementation relies heavily on the user's ability to explicitly manage the memory system.

  11. Parallel constraint satisfaction in memory-based decisions.

    Science.gov (United States)

    Glöckner, Andreas; Hodges, Sara D

    2011-01-01

    Three studies sought to investigate decision strategies in memory-based decisions and to test the predictions of the parallel constraint satisfaction (PCS) model for decision making (Glöckner & Betsch, 2008). Time pressure was manipulated and the model was compared against simple heuristics (take the best and equal weight) and a weighted additive strategy. From PCS we predicted that fast intuitive decision making is based on compensatory information integration and that decision time increases and confidence decreases with increasing inconsistency in the decision task. In line with these predictions we observed a predominant usage of compensatory strategies under all time-pressure conditions and even with decision times as short as 1.7 s. For a substantial number of participants, choices and decision times were best explained by PCS, but there was also evidence for use of simple heuristics. The time-pressure manipulation did not significantly affect decision strategies. Overall, the results highlight intuitive, automatic processes in decision making and support the idea that human information-processing capabilities are less severely bounded than often assumed.

  12. Battling memory requirements of array programming through streaming

    DEFF Research Database (Denmark)

    Kristensen, Mads Ruben Burgdorff; Avery, James Emil; Blum, Troels

    2016-01-01

    A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming......, implemented in the automatic parallelization high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead...... by performing calculations in per-thread registers. Using Bohrium, we automatically fuse, JIT-compile, and execute NumPy array operations on GPGPUs without modification to the user programs. We present performance evaluations of three benchmarks, all of which show dramatic reductions in memory use from...

  13. Distributed Memory Programming on Many-Cores

    DEFF Research Database (Denmark)

    Berthold, Jost; Dieterle, Mischa; Lobachev, Oleg

    2009-01-01

    Eden is a parallel extension of the lazy functional language Haskell providing dynamic process creation and automatic data exchange. As a Haskell extension, Eden takes a high-level approach to parallel programming and thereby simplifies parallel program development. The current implementation is ...

  14. Parallel programming practical aspects, models and current limitations

    CERN Document Server

    Tarkov, Mikhail S

    2014-01-01

    Parallel programming is designed for the use of parallel computer systems for solving time-consuming problems that cannot be solved on a sequential computer in a reasonable time. These problems can be divided into two classes: 1. Processing large data arrays (including processing images and signals in real time)2. Simulation of complex physical processes and chemical reactions For each of these classes, prospective methods are designed for solving problems. For data processing, one of the most promising technologies is the use of artificial neural networks. Particles-in-cell method and cellular automata are very useful for simulation. Problems of scalability of parallel algorithms and the transfer of existing parallel programs to future parallel computers are very acute now. An important task is to optimize the use of the equipment (including the CPU cache) of parallel computers. Along with parallelizing information processing, it is essential to ensure the processing reliability by the relevant organization ...

  15. On the Automatic Parallelization of Sparse and Irregular Fortran Programs

    Directory of Open Access Journals (Sweden)

    Yuan Lin

    1999-01-01

    Full Text Available Automatic parallelization is usually believed to be less effective at exploiting implicit parallelism in sparse/irregular programs than in their dense/regular counterparts. However, not much is really known because there have been few research reports on this topic. In this work, we have studied the possibility of using an automatic parallelizing compiler to detect the parallelism in sparse/irregular programs. The study with a collection of sparse/irregular programs led us to some common loop patterns. Based on these patterns new techniques were derived that produced good speedups when manually applied to our benchmark codes. More importantly, these parallelization methods can be implemented in a parallelizing compiler and can be applied automatically.

  16. Survey on present status and trend of parallel programming environments

    International Nuclear Information System (INIS)

    Takemiya, Hiroshi; Higuchi, Kenji; Honma, Ichiro; Ohta, Hirofumi; Kawasaki, Takuji; Imamura, Toshiyuki; Koide, Hiroshi; Akimoto, Masayuki.

    1997-03-01

    This report intends to provide useful information on software tools for parallel programming through the survey on parallel programming environments of the following six parallel computers, Fujitsu VPP300/500, NEC SX-4, Hitachi SR2201, Cray T94, IBM SP, and Intel Paragon, all of which are installed at Japan Atomic Energy Research Institute (JAERI), moreover, the present status of R and D's on parallel softwares of parallel languages, compilers, debuggers, performance evaluation tools, and integrated tools is reported. This survey has been made as a part of our project of developing a basic software for parallel programming environment, which is designed on the concept of STA (Seamless Thinking Aid to programmers). (author)

  17. The BLAZE language - A parallel language for scientific programming

    Science.gov (United States)

    Mehrotra, Piyush; Van Rosendale, John

    1987-01-01

    A Pascal-like scientific programming language, BLAZE, is described. BLAZE contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus BLAZE should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with conceptually sequential control flow. A central goal in the design of BLAZE is portability across a broad range of parallel architectures. The multiple levels of parallelism present in BLAZE code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of BLAZE are described and it is shown how this language would be used in typical scientific programming.

  18. The BLAZE language: A parallel language for scientific programming

    Science.gov (United States)

    Mehrotra, P.; Vanrosendale, J.

    1985-01-01

    A Pascal-like scientific programming language, Blaze, is described. Blaze contains array arithmetic, forall loops, and APL-style accumulation operators, which allow natural expression of fine grained parallelism. It also employs an applicative or functional procedure invocation mechanism, which makes it easy for compilers to extract coarse grained parallelism using machine specific program restructuring. Thus Blaze should allow one to achieve highly parallel execution on multiprocessor architectures, while still providing the user with onceptually sequential control flow. A central goal in the design of Blaze is portability across a broad range of parallel architectures. The multiple levels of parallelism present in Blaze code, in principle, allow a compiler to extract the types of parallelism appropriate for the given architecture while neglecting the remainder. The features of Blaze are described and shows how this language would be used in typical scientific programming.

  19. Professional Parallel Programming with C# Master Parallel Extensions with NET 4

    CERN Document Server

    Hillar, Gastón

    2010-01-01

    Expert guidance for those programming today's dual-core processors PCs As PC processors explode from one or two to now eight processors, there is an urgent need for programmers to master concurrent programming. This book dives deep into the latest technologies available to programmers for creating professional parallel applications using C#, .NET 4, and Visual Studio 2010. The book covers task-based programming, coordination data structures, PLINQ, thread pools, asynchronous programming model, and more. It also teaches other parallel programming techniques, such as SIMD and vectorization.Teach

  20. A Model for Speedup of Parallel Programs

    Science.gov (United States)

    1997-01-01

    Sanjeev. K Setia . The interaction between mem- ory allocation and adaptive partitioning in message- passing multicomputers. In IPPS 󈨣 Workshop on Job...Scheduling Strategies for Parallel Processing, pages 89{99, 1995. [15] Sanjeev K. Setia and Satish K. Tripathi. A compar- ative analysis of static

  1. The parallel processing system for fast 3D-CT image reconstruction by circular shifting float memory architecture

    International Nuclear Information System (INIS)

    Wang Shi; Kang Kejun; Wang Jingjin

    1996-01-01

    Computerized Tomography (CT) is expected to become an inevitable diagnostic technique in the future. However, the long time required to reconstruct an image has been one of the major drawbacks associated with this technique. Parallel process is one of the best way to solve this problem. This paper gives the architecture, hardware and software design of PIRS-4 (4-processor Parallel Image Reconstruction System), which is a parallel processing system for fast 3D-CT image reconstruction by circular shifting float memory architecture. It includes the structure and components of the system, the design of crossbar switch and details of control model, the description of RPBP image reconstruction, the choice of OS (Operate System) and language, the principle of imitating EMS, direct memory R/W of float and programming in the protect model. Finally, the test results are given

  2. Human Memory Organization for Computer Programs.

    Science.gov (United States)

    Norcio, A. F.; Kerst, Stephen M.

    1983-01-01

    Results of study investigating human memory organization in processing of computer programming languages indicate that algorithmic logic segments form a cognitive organizational structure in memory for programs. Statement indentation and internal program documentation did not enhance organizational process of recall of statements in five Fortran…

  3. An environment for parallel structuring of Fortran programs

    International Nuclear Information System (INIS)

    Sridharan, K.; McShea, M.; Denton, C.; Eventoff, B.; Browne, J.C.; Newton, P.; Ellis, M.; Grossbard, D.; Wise, T.; Clemmer, D.

    1990-01-01

    The paper describes and illustrates an environment for interactive support of the detection and implementation of macro-level parallelism in Fortran programs. The approach couples algorithms for dependence analysis with both innovative techniques for complexity management and capabilities for the measurement and analysis of the parallel computation structures generated through use of the environment. The resulting environment is complementary to the more common approach of seeking local parallelism by loop unrolling, either by an automatic compiler or manually. (orig.)

  4. Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems

    KAUST Repository

    Mudigere, Dheevatsa; Sridharan, Srinivas; Deshpande, Anand; Park, Jongsoo; Heinecke, Alexander; Smelyanskiy, Mikhail; Kaul, Bharat; Dubey, Pradeep; Kaushik, Dinesh; Keyes, David E.

    2015-01-01

    -grid implicit flow solver, which forms the backbone of computational aerodynamics, poses particular challenges due to its large irregular working sets, unstructured memory accesses, and variable/limited amount of parallelism. This code, based on a domain

  5. A general purpose subroutine for fast fourier transform on a distributed memory parallel machine

    Science.gov (United States)

    Dubey, A.; Zubair, M.; Grosch, C. E.

    1992-01-01

    One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.

  6. Computational cost of isogeometric multi-frontal solvers on parallel distributed memory machines

    KAUST Repository

    Woźniak, Maciej; Paszyński, Maciej R.; Pardo, D.; Dalcin, Lisandro; Calo, Victor M.

    2015-01-01

    This paper derives theoretical estimates of the computational cost for isogeometric multi-frontal direct solver executed on parallel distributed memory machines. We show theoretically that for the Cp-1 global continuity of the isogeometric solution

  7. Parallelization of MCNP 4, a Monte Carlo neutron and photon transport code system, in highly parallel distributed memory type computer

    International Nuclear Information System (INIS)

    Masukawa, Fumihiro; Takano, Makoto; Naito, Yoshitaka; Yamazaki, Takao; Fujisaki, Masahide; Suzuki, Koichiro; Okuda, Motoi.

    1993-11-01

    In order to improve the accuracy and calculating speed of shielding analyses, MCNP 4, a Monte Carlo neutron and photon transport code system, has been parallelized and measured of its efficiency in the highly parallel distributed memory type computer, AP1000. The code has been analyzed statically and dynamically, then the suitable algorithm for parallelization has been determined for the shielding analysis functions of MCNP 4. This includes a strategy where a new history is assigned to the idling processor element dynamically during the execution. Furthermore, to avoid the congestion of communicative processing, the batch concept, processing multi-histories by a unit, has been introduced. By analyzing a sample cask problem with 2,000,000 histories by the AP1000 with 512 processor elements, the 82 % of parallelization efficiency is achieved, and the calculational speed has been estimated to be around 50 times as fast as that of FACOM M-780. (author)

  8. Integrated Task And Data Parallel Programming: Language Design

    Science.gov (United States)

    Grimshaw, Andrew S.; West, Emily A.

    1998-01-01

    his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated

  9. Dataflow Query Execution in a Parallel, Main-memory Environment

    NARCIS (Netherlands)

    Wilschut, A.N.; Apers, Peter M.G.

    In this paper, the performance and characteristics of the execution of various join-trees on a parallel DBMS are studied. The results of this study are a step into the direction of the design of a query optimization strategy that is fit for parallel execution of complex queries. Among others,

  10. Dataflow Query Execution in a Parallel Main-Memory Environment

    NARCIS (Netherlands)

    Wilschut, A.N.; Apers, Peter M.G.

    1991-01-01

    The performance and characteristics of the execution of various join-trees on a parallel DBMS are studied. The results are a step in the direction of the design of a query optimization strategy that is fit for parallel execution of complex queries. Among others, synchronization issues are identified

  11. Memory Retrieval Given Two Independent Cues: Cue Selection or Parallel Access?

    Science.gov (United States)

    Rickard, Timothy C.; Bajic, Daniel

    2004-01-01

    A basic but unresolved issue in the study of memory retrieval is whether multiple independent cues can be used concurrently (i.e., in parallel) to recall a single, common response. A number of empirical results, as well as potentially applicable theories, suggest that retrieval can proceed in parallel, though Rickard (1997) set forth a model that…

  12. An Alternative Algorithm for Computing Watersheds on Shared Memory Parallel Computers

    NARCIS (Netherlands)

    Meijster, A.; Roerdink, J.B.T.M.

    1995-01-01

    In this paper a parallel implementation of a watershed algorithm is proposed. The algorithm can easily be implemented on shared memory parallel computers. The watershed transform is generally considered to be inherently sequential since the discrete watershed of an image is defined using recursion.

  13. A Programming Environment for Parallel Vision Algorithms

    Science.gov (United States)

    1990-04-11

    industrial arm on the market , while the unique head was designed by Rochester’s Computer Science and Mechanical Engineering Departments. 9a 4.1 Introduction...R. Constraining-Unification and the Programming Language Unicorn . In Logic Programming, Functions, Relations, and Equations, Degroot and Lind- strom

  14. Declarative Parallel Programming in Spreadsheet End-User Development

    DEFF Research Database (Denmark)

    Biermann, Florian

    2016-01-01

    Spreadsheets are first-order functional languages and are widely used in research and industry as a tool to conveniently perform all kinds of computations. Because cells on a spreadsheet are immutable, there are possibilities for implicit parallelization of spreadsheet computations. In this liter...... can directly apply results from functional array programming to a spreadsheet model of computations.......Spreadsheets are first-order functional languages and are widely used in research and industry as a tool to conveniently perform all kinds of computations. Because cells on a spreadsheet are immutable, there are possibilities for implicit parallelization of spreadsheet computations....... In this literature study, we provide an overview of the publications on spreadsheet end-user programming and declarative array programming to inform further research on parallel programming in spreadsheets. Our results show that there is a clear overlap between spreadsheet programming and array programming and we...

  15. A homotopy method for solving Riccati equations on a shared memory parallel computer

    International Nuclear Information System (INIS)

    Zigic, D.; Watson, L.T.; Collins, E.G. Jr.; Davis, L.D.

    1993-01-01

    Although there are numerous algorithms for solving Riccati equations, there still remains a need for algorithms which can operate efficiently on large problems and on parallel machines. This paper gives a new homotopy-based algorithm for solving Riccati equations on a shared memory parallel computer. The central part of the algorithm is the computation of the kernel of the Jacobian matrix, which is essential for the corrector iterations along the homotopy zero curve. Using a Schur decomposition the tensor product structure of various matrices can be efficiently exploited. The algorithm allows for efficient parallelization on shared memory machines

  16. Massively Parallel Polar Decomposition on Distributed-Memory Systems

    KAUST Repository

    Ltaief, Hatem; Sukkari, Dalal E.; Esposito, Aniello; Nakatsukasa, Yuji; Keyes, David E.

    2018-01-01

    We present a high-performance implementation of the Polar Decomposition (PD) on distributed-memory systems. Building upon on the QR-based Dynamically Weighted Halley (QDWH) algorithm, the key idea lies in finding the best rational approximation

  17. Memory, reasoning, and categorization: parallels and common mechanisms

    OpenAIRE

    Hayes, Brett K.; Heit, Evan; Rotello, Caren M.

    2014-01-01

    Traditionally, memory, reasoning and categorization have been treated as separate components of human cognition. We challenge this distinction, arguing that there is broad scope for crossover between the methods and theories developed for each task. The links between memory and reasoning are illustrated in a review of two lines of research. The first takes theoretical ideas (two-process accounts) and methodological tools (signal detection analysis, receiver operating characteristic curves)...

  18. Program Transformation to Identify List-Based Parallel Skeletons

    Directory of Open Access Journals (Sweden)

    Venkatesh Kannan

    2016-07-01

    Full Text Available Algorithmic skeletons are used as building-blocks to ease the task of parallel programming by abstracting the details of parallel implementation from the developer. Most existing libraries provide implementations of skeletons that are defined over flat data types such as lists or arrays. However, skeleton-based parallel programming is still very challenging as it requires intricate analysis of the underlying algorithm and often uses inefficient intermediate data structures. Further, the algorithmic structure of a given program may not match those of list-based skeletons. In this paper, we present a method to automatically transform any given program to one that is defined over a list and is more likely to contain instances of list-based skeletons. This facilitates the parallel execution of a transformed program using existing implementations of list-based parallel skeletons. Further, by using an existing transformation called distillation in conjunction with our method, we produce transformed programs that contain fewer inefficient intermediate data structures.

  19. Wnt signaling inhibits CTL memory programming.

    Science.gov (United States)

    Xiao, Zhengguo; Sun, Zhifeng; Smyth, Kendra; Li, Lei

    2013-12-01

    Induction of functional CTLs is one of the major goals for vaccine development and cancer therapy. Inflammatory cytokines are critical for memory CTL generation. Wnt signaling is important for CTL priming and memory formation, but its role in cytokine-driven memory CTL programming is unclear. We found that wnt signaling inhibited IL-12-driven CTL activation and memory programming. This impaired memory CTL programming was attributed to up-regulation of eomes and down-regulation of T-bet. Wnt signaling suppressed the mTOR pathway during CTL activation, which was different to its effects on other cell types. Interestingly, the impaired memory CTL programming by wnt was partially rescued by mTOR inhibitor rapamycin. In conclusion, we found that crosstalk between wnt and the IL-12 signaling inhibits T-bet and mTOR pathways and impairs memory programming which can be recovered in part by rapamycin. In addition, direct inhibition of wnt signaling during CTL activation does not affect CTL memory programming. Therefore, wnt signaling may serve as a new tool for CTL manipulation in autoimmune diseases and immune therapy for certain cancers. Copyright © 2013 Elsevier Ltd. All rights reserved.

  20. Improved Parallel Three-List Algorithm for the Knapsack Problem without Memory Conflicts

    Institute of Scientific and Technical Information of China (English)

    Pan Jun; Li Kenli; Li Qinghua

    2006-01-01

    Based on the two-list algorithm and the parallel three-list algorithm, an improved parallel three-list algorithm for knapsack problem is proposed, in which the method of divide and conquer, and parallel merging without memory conflicts are adopted. To find a solution for the n-element knapsack problem, the proposed algorithm needs O(23n/8) time when O(23n/8) shared memory units and O(2n/4) processors are available. The comparisons between the proposed algorithm and 10 existing algorithms show that the improved parallel three-list algorithm is the first exclusive-read exclusive-write (EREW) parallel algorithm that can solve the knapsack instances in less than O(2n/2) time when the available hardware resource is smaller than O(2n/2), and hence is an improved result over the past researches.

  1. Memory, reasoning, and categorization: parallels and common mechanisms.

    Science.gov (United States)

    Hayes, Brett K; Heit, Evan; Rotello, Caren M

    2014-01-01

    Traditionally, memory, reasoning, and categorization have been treated as separate components of human cognition. We challenge this distinction, arguing that there is broad scope for crossover between the methods and theories developed for each task. The links between memory and reasoning are illustrated in a review of two lines of research. The first takes theoretical ideas (two-process accounts) and methodological tools (signal detection analysis, receiver operating characteristic curves) from memory research and applies them to important issues in reasoning research: relations between induction and deduction, and the belief bias effect. The second line of research introduces a task in which subjects make either memory or reasoning judgments for the same set of stimuli. Other than broader generalization for reasoning than memory, the results were similar for the two tasks, across a variety of experimental stimuli and manipulations. It was possible to simultaneously explain performance on both tasks within a single cognitive architecture, based on exemplar-based comparisons of similarity. The final sections explore evidence for empirical and processing links between inductive reasoning and categorization and between categorization and recognition. An important implication is that progress in all three of these fields will be expedited by further investigation of the many commonalities between these tasks.

  2. Memory, reasoning and categorization: Parallels and common mechanisms

    Directory of Open Access Journals (Sweden)

    BRETT eHAYES

    2014-06-01

    Full Text Available Traditionally, memory, reasoning and categorization have been treated as separate components of human cognition. We challenge this distinction, arguing that there is broad scope for crossover between the methods and theories developed for each task. The links between memory and reasoning are illustrated in a review of two lines of research. The first takes theoretical ideas (two-process accounts and methodological tools (signal detection analysis, receiver operating characteristic curves from memory research and applies them to important issues in reasoning research: relations between induction and deduction, and the belief bias effect. The second line of research introduces a task in which subjects make either memory or reasoning judgments for the same set of stimuli. Other than broader generalization for reasoning than memory, the results were similar for the two tasks, across a variety of experimental stimuli and manipulations. It was possible to simultaneously explain performance on both tasks within a single cognitive architecture, based on exemplar-based comparisons of similarity. The final sections explore evidence for empirical and processing links between inductive reasoning and categorization and between categorization and recognition. An important implication is that progress in all three of these fields will be expedited by further investigation of the many commonalities between these tasks.

  3. Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

    Science.gov (United States)

    Quealy, Angela; Cole, Gary L.; Blech, Richard A.

    1993-01-01

    The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.

  4. Development of massively parallel quantum chemistry program SMASH

    Energy Technology Data Exchange (ETDEWEB)

    Ishimura, Kazuya [Department of Theoretical and Computational Molecular Science, Institute for Molecular Science 38 Nishigo-Naka, Myodaiji, Okazaki, Aichi 444-8585 (Japan)

    2015-12-31

    A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C{sub 150}H{sub 30}){sub 2} with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer.

  5. Development of massively parallel quantum chemistry program SMASH

    International Nuclear Information System (INIS)

    Ishimura, Kazuya

    2015-01-01

    A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C 150 H 30 ) 2 with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer

  6. On program restructuring, scheduling, and communication for parallel processor systems

    Energy Technology Data Exchange (ETDEWEB)

    Polychronopoulos, Constantine D. [Univ. of Illinois, Urbana, IL (United States)

    1986-08-01

    This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented.

  7. Parallel programming of saccades during natural scene viewing: evidence from eye movement positions.

    Science.gov (United States)

    Wu, Esther X W; Gilani, Syed Omer; van Boxtel, Jeroen J A; Amihai, Ido; Chua, Fook Kee; Yen, Shih-Cheng

    2013-10-24

    Previous studies have shown that saccade plans during natural scene viewing can be programmed in parallel. This evidence comes mainly from temporal indicators, i.e., fixation durations and latencies. In the current study, we asked whether eye movement positions recorded during scene viewing also reflect parallel programming of saccades. As participants viewed scenes in preparation for a memory task, their inspection of the scene was suddenly disrupted by a transition to another scene. We examined whether saccades after the transition were invariably directed immediately toward the center or were contingent on saccade onset times relative to the transition. The results, which showed a dissociation in eye movement behavior between two groups of saccades after the scene transition, supported the parallel programming account. Saccades with relatively long onset times (>100 ms) after the transition were directed immediately toward the center of the scene, probably to restart scene exploration. Saccades with short onset times (programming of saccades during scene viewing. Additionally, results from the analyses of intersaccadic intervals were also consistent with the parallel programming hypothesis.

  8. Basic design of parallel computational program for probabilistic structural analysis

    International Nuclear Information System (INIS)

    Kaji, Yoshiyuki; Arai, Taketoshi; Gu, Wenwei; Nakamura, Hitoshi

    1999-06-01

    In our laboratory, for 'development of damage evaluation method of structural brittle materials by microscopic fracture mechanics and probabilistic theory' (nuclear computational science cross-over research) we examine computational method related to super parallel computation system which is coupled with material strength theory based on microscopic fracture mechanics for latent cracks and continuum structural model to develop new structural reliability evaluation methods for ceramic structures. This technical report is the review results regarding probabilistic structural mechanics theory, basic terms of formula and program methods of parallel computation which are related to principal terms in basic design of computational mechanics program. (author)

  9. Basic design of parallel computational program for probabilistic structural analysis

    Energy Technology Data Exchange (ETDEWEB)

    Kaji, Yoshiyuki; Arai, Taketoshi [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment; Gu, Wenwei; Nakamura, Hitoshi

    1999-06-01

    In our laboratory, for `development of damage evaluation method of structural brittle materials by microscopic fracture mechanics and probabilistic theory` (nuclear computational science cross-over research) we examine computational method related to super parallel computation system which is coupled with material strength theory based on microscopic fracture mechanics for latent cracks and continuum structural model to develop new structural reliability evaluation methods for ceramic structures. This technical report is the review results regarding probabilistic structural mechanics theory, basic terms of formula and program methods of parallel computation which are related to principal terms in basic design of computational mechanics program. (author)

  10. Parallel-vector algorithms for particle simulations on shared-memory multiprocessors

    International Nuclear Information System (INIS)

    Nishiura, Daisuke; Sakaguchi, Hide

    2011-01-01

    Over the last few decades, the computational demands of massive particle-based simulations for both scientific and industrial purposes have been continuously increasing. Hence, considerable efforts are being made to develop parallel computing techniques on various platforms. In such simulations, particles freely move within a given space, and so on a distributed-memory system, load balancing, i.e., assigning an equal number of particles to each processor, is not guaranteed. However, shared-memory systems achieve better load balancing for particle models, but suffer from the intrinsic drawback of memory access competition, particularly during (1) paring of contact candidates from among neighboring particles and (2) force summation for each particle. Here, novel algorithms are proposed to overcome these two problems. For the first problem, the key is a pre-conditioning process during which particle labels are sorted by a cell label in the domain to which the particles belong. Then, a list of contact candidates is constructed by pairing the sorted particle labels. For the latter problem, a table comprising the list indexes of the contact candidate pairs is created and used to sum the contact forces acting on each particle for all contacts according to Newton's third law. With just these methods, memory access competition is avoided without additional redundant procedures. The parallel efficiency and compatibility of these two algorithms were evaluated in discrete element method (DEM) simulations on four types of shared-memory parallel computers: a multicore multiprocessor computer, scalar supercomputer, vector supercomputer, and graphics processing unit. The computational efficiency of a DEM code was found to be drastically improved with our algorithms on all but the scalar supercomputer. Thus, the developed parallel algorithms are useful on shared-memory parallel computers with sufficient memory bandwidth.

  11. A language for data-parallel and task parallel programming dedicated to multi-SIMD computers. Contributions to hydrodynamic simulation with lattice gases

    International Nuclear Information System (INIS)

    Pic, Marc Michel

    1995-01-01

    Parallel programming covers task-parallelism and data-parallelism. Many problems need both parallelisms. Multi-SIMD computers allow hierarchical approach of these parallelisms. The T++ language, based on C++, is dedicated to exploit Multi-SIMD computers using a programming paradigm which is an extension of array-programming to tasks managing. Our language introduced array of independent tasks to achieve separately (MIMD), on subsets of processors of identical behaviour (SIMD), in order to translate the hierarchical inclusion of data-parallelism in task-parallelism. To manipulate in a symmetrical way tasks and data we propose meta-operations which have the same behaviour on tasks arrays and on data arrays. We explain how to implement this language on our parallel computer SYMPHONIE in order to profit by the locally-shared memory, by the hardware virtualization, and by the multiplicity of communications networks. We analyse simultaneously a typical application of such architecture. Finite elements scheme for Fluid mechanic needs powerful parallel computers and requires large floating points abilities. Lattice gases is an alternative to such simulations. Boolean lattice bases are simple, stable, modular, need to floating point computation, but include numerical noise. Boltzmann lattice gases present large precision of computation, but needs floating points and are only locally stable. We propose a new scheme, called multi-bit, who keeps the advantages of each boolean model to which it is applied, with large numerical precision and reduced noise. Experiments on viscosity, physical behaviour, noise reduction and spurious invariants are shown and implementation techniques for parallel Multi-SIMD computers detailed. (author) [fr

  12. Method for programming a flash memory

    Energy Technology Data Exchange (ETDEWEB)

    Brosky, Alexander R.; Locke, William N.; Maher, Conrado M.

    2016-08-23

    A method of programming a flash memory is described. The method includes partitioning a flash memory into a first group having a first level of write-protection, a second group having a second level of write-protection, and a third group having a third level of write-protection. The write-protection of the second and third groups is disabled using an installation adapter. The third group is programmed using a Software Installation Device.

  13. Fencing network direct memory access data transfers in a parallel active messaging interface of a parallel computer

    Science.gov (United States)

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-07-07

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  14. Memory effect on energy losses of charged particles moving parallel to solid surface

    International Nuclear Information System (INIS)

    Kwei, C.M.; Tu, Y.H.; Hsu, Y.H.; Tung, C.J.

    2006-01-01

    Theoretical derivations were made for the induced potential and the stopping power of a charged particle moving close and parallel to the surface of a solid. It was illustrated that the induced potential produced by the interaction of particle and solid depended not only on the velocity but also on the previous velocity of the particle before its last inelastic interaction. Another words, the particle kept a memory on its previous velocity, v , in determining the stopping power for the particle of velocity v. Based on the dielectric response theory, formulas were derived for the induced potential and the stopping power with memory effect. An extended Drude dielectric function with spatial dispersion was used in the application of these formulas for a proton moving parallel to Si surface. It was found that the induced potential with memory effect lay between induced potentials without memory effect for constant velocities v and v. The memory effect was manifest as the proton changes its velocity in the previous inelastic interaction. This memory effect also reduced the stopping power of the proton. The formulas derived in the present work can be applied to any solid surface and charged particle moving with arbitrary parallel trajectory either inside or outside the solid

  15. Heterogeneous Multicore Parallel Programming for Graphics Processing Units

    Directory of Open Access Journals (Sweden)

    Francois Bodin

    2009-01-01

    Full Text Available Hybrid parallel multicore architectures based on graphics processing units (GPUs can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.

  16. An approach to multicore parallelism using functional programming: A case study based on Presburger Arithmetic

    DEFF Research Database (Denmark)

    Dung, Phan Anh; Hansen, Michael Reichhardt

    2015-01-01

    In this paper we investigate multicore parallelism in the context of functional programming by means of two quantifier-elimination procedures for Presburger Arithmetic: one is based on Cooper’s algorithm and the other is based on the Omega Test. We first develop correct-by-construction prototype...... platform executing on an 8-core machine. A speedup of approximately 4 was obtained for Cooper’s algorithm and a speedup of approximately 6 was obtained for the exact-shadow part of the Omega Test. The considered procedures are complex, memory-intense algorithms on huge formula trees and the case study...... reveals more general applicable techniques and guideline for deriving parallel algorithms from sequential ones in the context of data-intensive tree algorithms. The obtained insights should apply for any strict and impure functional programming language. Furthermore, the results obtained for the exact...

  17. Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.

    Science.gov (United States)

    Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias

    2011-01-01

    The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.

  18. A new shared-memory programming paradigm for molecular dynamics simulations on the Intel Paragon

    International Nuclear Information System (INIS)

    D'Azevedo, E.F.; Romine, C.H.

    1994-12-01

    This report describes the use of shared memory emulation with DOLIB (Distributed Object Library) to simplify parallel programming on the Intel Paragon. A molecular dynamics application is used as an example to illustrate the use of the DOLIB shared memory library. SOTON-PAR, a parallel molecular dynamics code with explicit message-passing using a Lennard-Jones 6-12 potential, is rewritten using DOLIB primitives. The resulting code has no explicit message primitives and resembles a serial code. The new code can perform dynamic load balancing and achieves better performance than the original parallel code with explicit message-passing

  19. High Performance Programming Using Explicit Shared Memory Model on Cray T3D1

    Science.gov (United States)

    Simon, Horst D.; Saini, Subhash; Grassi, Charles

    1994-01-01

    The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.

  20. Scientific programming on massively parallel processor CP-PACS

    International Nuclear Information System (INIS)

    Boku, Taisuke

    1998-01-01

    The massively parallel processor CP-PACS takes various problems of calculation physics as the object, and it has been designed so that its architecture has been devised to do various numerical processings. In this report, the outline of the CP-PACS and the example of programming in the Kernel CG benchmark in NAS Parallel Benchmarks, version 1, are shown, and the pseudo vector processing mechanism and the parallel processing tuning of scientific and technical computation utilizing the three-dimensional hyper crossbar net, which are two great features of the architecture of the CP-PACS are described. As for the CP-PACS, the PUs based on RISC processor and added with pseudo vector processor are used. Pseudo vector processing is realized as the loop processing by scalar command. The features of the connection net of PUs are explained. The algorithm of the NPB version 1 Kernel CG is shown. The part that takes the time for processing most in the main loop is the product of matrix and vector (matvec), and the parallel processing of the matvec is explained. The time for the computation by the CPU is determined. As the evaluation of the performance, the evaluation of the time for execution, the short vector processing of pseudo vector processor based on slide window, and the comparison with other parallel computers are reported. (K.I.)

  1. Evidence for parallel consolidation of motion direction and orientation into visual short-term memory.

    Science.gov (United States)

    Rideaux, Reuben; Apthorp, Deborah; Edwards, Mark

    2015-02-12

    Recent findings have indicated the capacity to consolidate multiple items into visual short-term memory in parallel varies as a function of the type of information. That is, while color can be consolidated in parallel, evidence suggests that orientation cannot. Here we investigated the capacity to consolidate multiple motion directions in parallel and reexamined this capacity using orientation. This was achieved by determining the shortest exposure duration necessary to consolidate a single item, then examining whether two items, presented simultaneously, could be consolidated in that time. The results show that parallel consolidation of direction and orientation information is possible, and that parallel consolidation of direction appears to be limited to two. Additionally, we demonstrate the importance of adequate separation between feature intervals used to define items when attempting to consolidate in parallel, suggesting that when multiple items are consolidated in parallel, as opposed to serially, the resolution of representations suffer. Finally, we used facilitation of spatial attention to show that the deterioration of item resolution occurs during parallel consolidation, as opposed to storage. © 2015 ARVO.

  2. Final Report: Center for Programming Models for Scalable Parallel Computing

    Energy Technology Data Exchange (ETDEWEB)

    Mellor-Crummey, John [William Marsh Rice University

    2011-09-13

    As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.

  3. A modified parallel paradigm for clinical evaluation of auditory echoic memory.

    Science.gov (United States)

    Karino, Shotaro; Yumoto, Masato; Itoh, Kenji; Yamakawa, Keiko; Mizuochi, Tomomi; Kaga, Kimitaka

    2005-05-12

    We established a new parallel paradigm for mismatch negativity by presenting repetitive trains of three consonant-vowel syllables and those of three sinusoidal tones alternately. Magnetoencephalography was performed to test the new method, and mismatch negativities in six study participants with normal hearing were compared with the results of the conventional oddball paradigm. Peak amplitude and latencies of mismatch negativity showed no significant difference between the methods. The maximum amplitude in short memory probe interval of 1.0 s was significantly larger than in long memory probe interval of 3.0 s, demonstrating decay in auditory echoic memory caused by a prolonged memory probe interval. The new method facilitated simultaneous evaluation of mismatch negativity with various stimuli in a shorter period.

  4. Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers

    KAUST Repository

    Woźniak, Maciej

    2014-06-01

    In this paper we present computational cost estimates for parallel shared memory isogeometric multi-frontal solvers. The estimates show that the ideal isogeometric shared memory parallel direct solver scales as O( p2log(N/p)) for one dimensional problems, O(Np2) for two dimensional problems, and O(N4/3p2) for three dimensional problems, where N is the number of degrees of freedom, and p is the polynomial order of approximation. The computational costs of the shared memory parallel isogeometric direct solver are compared with those corresponding to the sequential isogeometric direct solver, being the latest equal to O(N p2) for the one dimensional case, O(N1.5p3) for the two dimensional case, and O(N2p3) for the three dimensional case. The shared memory version significantly reduces both the scalability in terms of N and p. Theoretical estimates are compared with numerical experiments performed with linear, quadratic, cubic, quartic, and quintic B-splines, in one and two spatial dimensions. © 2014 Elsevier Ltd. All rights reserved.

  5. Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers

    KAUST Repository

    Woźniak, Maciej; Kuźnik, Krzysztof M.; Paszyński, Maciej R.; Calo, Victor M.; Pardo, D.

    2014-01-01

    In this paper we present computational cost estimates for parallel shared memory isogeometric multi-frontal solvers. The estimates show that the ideal isogeometric shared memory parallel direct solver scales as O( p2log(N/p)) for one dimensional problems, O(Np2) for two dimensional problems, and O(N4/3p2) for three dimensional problems, where N is the number of degrees of freedom, and p is the polynomial order of approximation. The computational costs of the shared memory parallel isogeometric direct solver are compared with those corresponding to the sequential isogeometric direct solver, being the latest equal to O(N p2) for the one dimensional case, O(N1.5p3) for the two dimensional case, and O(N2p3) for the three dimensional case. The shared memory version significantly reduces both the scalability in terms of N and p. Theoretical estimates are compared with numerical experiments performed with linear, quadratic, cubic, quartic, and quintic B-splines, in one and two spatial dimensions. © 2014 Elsevier Ltd. All rights reserved.

  6. Towards Interactive Visual Exploration of Parallel Programs using a Domain-Specific Language

    KAUST Repository

    Klein, Tobias

    2016-04-19

    The use of GPUs and the massively parallel computing paradigm have become wide-spread. We describe a framework for the interactive visualization and visual analysis of the run-time behavior of massively parallel programs, especially OpenCL kernels. This facilitates understanding a program\\'s function and structure, finding the causes of possible slowdowns, locating program bugs, and interactively exploring and visually comparing different code variants in order to improve performance and correctness. Our approach enables very specific, user-centered analysis, both in terms of the recording of the run-time behavior and the visualization itself. Instead of having to manually write instrumented code to record data, simple code annotations tell the source-to-source compiler which code instrumentation to generate automatically. The visualization part of our framework then enables the interactive analysis of kernel run-time behavior in a way that can be very specific to a particular problem or optimization goal, such as analyzing the causes of memory bank conflicts or understanding an entire parallel algorithm.

  7. Parallel effects of memory set activation and searchon timing and working memory capacity

    Directory of Open Access Journals (Sweden)

    Richard eSchweickert

    2014-07-01

    Full Text Available Accurately estimating a time interval is required in everyday activities such as driving or cooking. Estimating time is relatively easy, provided a person attends to it. But a brief shift of attention to another task usually interferes with timing. Most processes carried out concurrently with timing interfere with it. Curiously, some do not. Literature on a few processes suggests a general proposition, the Timing and Complex-Span Hypothesis: A process interferes with concurrent timing if and only if process performance is related to complex span. Complex-span is the number of items correctly recalled in order, when each item presented for study is followed by a brief activity. Literature on task switching, visual search, memory search, word generation and mental time travel supports the hypothesis. Previous work found that another process, activation of a memory set in long term memory, is not related to complex-span. If the Timing and Complex-Span Hypothesis is true, activation should not interfere with concurrent timing in dual-task conditions. We tested such activation in single-task memory search task conditions and in dual-task conditions where memory search was executed with concurrent timing. In Experiment 1, activating a memory set increased reaction time, with no significant effect on time production. In Experiment 2, set size and memory set activation were manipulated. Activation and set size had a puzzling interaction for time productions, perhaps due to difficult conditions, leading us to use a related but easier task in Experiment 3. In Experiment 3 increasing set size lengthened time production, but memory activation had no significant effect. Results here and in previous literature on the whole support the Timing and Complex-Span Hypotheses. Results also support a sequential organization of activation and search of memory. This organization predicts activation and set size have additive effects on reaction time and multiplicative

  8. Parallel effects of memory set activation and search on timing and working memory capacity.

    Science.gov (United States)

    Schweickert, Richard; Fortin, Claudette; Xi, Zhuangzhuang; Viau-Quesnel, Charles

    2014-01-01

    Accurately estimating a time interval is required in everyday activities such as driving or cooking. Estimating time is relatively easy, provided a person attends to it. But a brief shift of attention to another task usually interferes with timing. Most processes carried out concurrently with timing interfere with it. Curiously, some do not. Literature on a few processes suggests a general proposition, the Timing and Complex-Span Hypothesis: A process interferes with concurrent timing if and only if process performance is related to complex span. Complex-span is the number of items correctly recalled in order, when each item presented for study is followed by a brief activity. Literature on task switching, visual search, memory search, word generation and mental time travel supports the hypothesis. Previous work found that another process, activation of a memory set in long term memory, is not related to complex-span. If the Timing and Complex-Span Hypothesis is true, activation should not interfere with concurrent timing in dual-task conditions. We tested such activation in single-task memory search task conditions and in dual-task conditions where memory search was executed with concurrent timing. In Experiment 1, activating a memory set increased reaction time, with no significant effect on time production. In Experiment 2, set size and memory set activation were manipulated. Activation and set size had a puzzling interaction for time productions, perhaps due to difficult conditions, leading us to use a related but easier task in Experiment 3. In Experiment 3 increasing set size lengthened time production, but memory activation had no significant effect. Results here and in previous literature on the whole support the Timing and Complex-Span Hypotheses. Results also support a sequential organization of activation and search of memory. This organization predicts activation and set size have additive effects on reaction time and multiplicative effects on percent

  9. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems

    OpenAIRE

    Albutiu, Martina-Cezara; Kemper, Alfons; Neumann, Thomas

    2012-01-01

    Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel mult...

  10. Iterative schemes for parallel Sn algorithms in a shared-memory computing environment

    International Nuclear Information System (INIS)

    Haghighat, A.; Hunter, M.A.; Mattis, R.E.

    1995-01-01

    Several two-dimensional spatial domain partitioning S n transport theory algorithms are developed on the basis of different iterative schemes. These algorithms are incorporated into TWOTRAN-II and tested on the shared-memory CRAY Y-MP C90 computer. For a series of fixed-source r-z geometry homogeneous problems, it is demonstrated that the concurrent red-black algorithms may result in large parallel efficiencies (>60%) on C90. It is also demonstrated that for a realistic shielding problem, the use of the negative flux fixup causes high load imbalance, which results in a significant loss of parallel efficiency

  11. Feedback Driven Annotation and Refactoring of Parallel Programs

    DEFF Research Database (Denmark)

    Larsen, Per

    and communication in embedded programs. Runtime checks are developed to ensure that annotations correctly describe observable program behavior. The performance impact of runtime checking is evaluated on several benchmark kernels and is negligible in all cases. The second aspect is compilation feedback. Annotations...... are not effective unless programmers are told how and when they are benecial. A prototype compilation feedback system was developed in collaboration with IBM Haifa Research Labs. It reports issues that prevent further analysis to the programmer. Performance evaluation shows that three programs performes signicantly......This thesis combines programmer knowledge and feedback to improve modeling and optimization of software. The research is motivated by two observations. First, there is a great need for automatic analysis of software for embedded systems - to expose and model parallelism inherent in programs. Second...

  12. Programming Robots with Associative Memories

    International Nuclear Information System (INIS)

    Touzet, C.

    1999-01-01

    Today, there are several drawbacks that impede the necessary and much needed use of robot learning techniques in real applications. First, the time needed to achieve the synthesis of any behavior is prohibitive. Second, the robot behavior during the learning phase is by definition bad, it may even be dangerous. Third, except within the lazy learning approach, a new behavior implies a new learning phase. We propose in this paper to use self-organizing maps to encode the non explicit model of the robot-world interaction sampled by the lazy memory, and then generate a robot behavior by means of situations to be achieved, i.e., points on the self-organizing maps. Any behavior can instantaneously be synthesized by the definition of a goal situation. Its performance will be minimal (not evidently bad) and will improve by the mere repetition of the behavior

  13. Programming Robots with Associative Memories

    Energy Technology Data Exchange (ETDEWEB)

    Touzet, C

    1999-07-10

    Today, there are several drawbacks that impede the necessary and much needed use of robot learning techniques in real applications. First, the time needed to achieve the synthesis of any behavior is prohibitive. Second, the robot behavior during the learning phase is "by definition" bad, it may even be dangerous. Third, except within the lazy learning approach, a new behavior implies a new learning phase. We propose in this paper to use self-organizing maps to encode the non explicit model of the robot-world interaction sampled by the lazy memory, and then generate a robot behavior by means of situations to be achieved, i.e., points on the self-organizing maps. Any behavior can instantaneously be synthesized by the definition of a goal situation. Its performance will be minimal (not evidently bad) and will improve by the mere repetition of the behavior.

  14. A Screen Space GPGPU Surface LIC Algorithm for Distributed Memory Data Parallel Sort Last Rendering Infrastructures

    Energy Technology Data Exchange (ETDEWEB)

    Loring, Burlen; Karimabadi, Homa; Rortershteyn, Vadim

    2014-07-01

    The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.

  15. Towards Interactive Visual Exploration of Parallel Programs using a Domain-Specific Language

    KAUST Repository

    Klein, Tobias; Bruckner, Stefan; Grö ller, M. Eduard; Hadwiger, Markus; Rautek, Peter

    2016-01-01

    The use of GPUs and the massively parallel computing paradigm have become wide-spread. We describe a framework for the interactive visualization and visual analysis of the run-time behavior of massively parallel programs, especially OpenCL kernels. This facilitates understanding a program's function and structure, finding the causes of possible slowdowns, locating program bugs, and interactively exploring and visually comparing different code variants in order to improve performance and correctness. Our approach enables very specific, user-centered analysis, both in terms of the recording of the run-time behavior and the visualization itself. Instead of having to manually write instrumented code to record data, simple code annotations tell the source-to-source compiler which code instrumentation to generate automatically. The visualization part of our framework then enables the interactive analysis of kernel run-time behavior in a way that can be very specific to a particular problem or optimization goal, such as analyzing the causes of memory bank conflicts or understanding an entire parallel algorithm.

  16. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems

    OpenAIRE

    Martina-Cezara Albutiu, Alfons Kemper, Thomas Neumann

    2012-01-01

    Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research ...

  17. Computational cost of isogeometric multi-frontal solvers on parallel distributed memory machines

    KAUST Repository

    Woźniak, Maciej

    2015-02-01

    This paper derives theoretical estimates of the computational cost for isogeometric multi-frontal direct solver executed on parallel distributed memory machines. We show theoretically that for the Cp-1 global continuity of the isogeometric solution, both the computational cost and the communication cost of a direct solver are of order O(log(N)p2) for the one dimensional (1D) case, O(Np2) for the two dimensional (2D) case, and O(N4/3p2) for the three dimensional (3D) case, where N is the number of degrees of freedom and p is the polynomial order of the B-spline basis functions. The theoretical estimates are verified by numerical experiments performed with three parallel multi-frontal direct solvers: MUMPS, PaStiX and SuperLU, available through PETIGA toolkit built on top of PETSc. Numerical results confirm these theoretical estimates both in terms of p and N. For a given problem size, the strong efficiency rapidly decreases as the number of processors increases, becoming about 20% for 256 processors for a 3D example with 1283 unknowns and linear B-splines with C0 global continuity, and 15% for a 3D example with 643 unknowns and quartic B-splines with C3 global continuity. At the same time, one cannot arbitrarily increase the problem size, since the memory required by higher order continuity spaces is large, quickly consuming all the available memory resources even in the parallel distributed memory version. Numerical results also suggest that the use of distributed parallel machines is highly beneficial when solving higher order continuity spaces, although the number of processors that one can efficiently employ is somehow limited.

  18. Sensory-specific clock components and memory mechanisms: investigation with parallel timing.

    Science.gov (United States)

    Gamache, Pierre-Luc; Grondin, Simon

    2010-05-01

    A challenge for researchers in the time-perception field is to determine whether temporal processing is governed by a central mechanism or by multiple mechanisms working in concert. Behavioral studies of parallel timing offer interesting insights into the question, although the conclusions fail to converge. Most of these studies focus on the number-of-clocks issue, but the commonality of memory mechanisms involved in time processing is often neglected. The present experiment aims to address a straightforward question: do signals from different modalities marking time intervals share the same clock and/or the same memory resources? To this end, an interval reproduction task involving the parallel timing of two sensory signals presented either in the same modality or in different modalities was conducted. The memory component was tested by manipulating the delay separating the presentation of the target intervals and the moment when the reproduction of one of these began. Results show that there is more variance when only visually marked intervals are presented, and this effect is exacerbated with longer retention delays. Finally, when there is only one interval to process, encoding the interval with signals delivered from two modalities helps to reduce variance. Taken together, these results suggest that the hypothesis stating that there are sensory-specific clock components and memory mechanisms is viable.

  19. A scalable parallel algorithm for multiple objective linear programs

    Science.gov (United States)

    Wiecek, Malgorzata M.; Zhang, Hong

    1994-01-01

    This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.

  20. MPI_XSTAR: MPI-based parallelization of XSTAR program

    Science.gov (United States)

    Danehkar, A.

    2017-12-01

    MPI_XSTAR parallelizes execution of multiple XSTAR runs using Message Passing Interface (MPI). XSTAR (ascl:9910.008), part of the HEASARC's HEAsoft (ascl:1408.004) package, calculates the physical conditions and emission spectra of ionized gases. MPI_XSTAR invokes XSTINITABLE from HEASoft to generate a job list of XSTAR commands for given physical parameters. The job list is used to make directories in ascending order, where each individual XSTAR is spawned on each processor and outputs are saved. HEASoft's XSTAR2TABLE program is invoked upon the contents of each directory in order to produce table model FITS files for spectroscopy analysis tools.

  1. Memory Vulnerability Diagnosis for Binary Program

    Directory of Open Access Journals (Sweden)

    Tang Feng-Yi

    2016-01-01

    Full Text Available Vulnerability diagnosis is important for program security analysis. It is a further step to understand the vulnerability after it is detected, as well as a preparatory step for vulnerability repair or exploitation. This paper mainly analyses the inner theories of major memory vulnerabilities and the threats of them. And then suggests some methods to diagnose several types of memory vulnerabilities for the binary programs, which is a difficult task due to the lack of source code. The diagnosis methods target at buffer overflow, use after free (UAF and format string vulnerabilities. We carried out some tests on the Linux platform to validate the effectiveness of the diagnosis methods. It is proved that the methods can judge the type of the vulnerability given a binary program.

  2. A program system for ab initio MO calculations on vector and parallel processing machines. Pt. 3

    International Nuclear Information System (INIS)

    Wiest, R.; Demuynck, J.; Benard, M.; Rohmer, M.M.; Ernenwein, R.

    1991-01-01

    This series of three papers presents a program system for ab initio molecular orbital calculations on vector and parallel computers. Part III is devoted to the four-index transformation on a molecular orbital basis of size NMO of the file of two-electorn integrals (pqparallelrs) generated by a contracted Gaussian set of size NATO (number of atomic orbitals). A fast Yoshimine algorithm first sorts the (pqparallelrs) integrals with respect to index pq only. This file of half-sorted integrals labelled by their rs-index can be processed without further modification to generate either the transformed integrals or the supermatrix elements. The large memory available on the CRAY-2 hase made possible to implement the transformation algorithm proposed by Bender in 1972, which requires a core-storage allocation varying as (NATO) 3 . Two versions of Bender's algorithm are included in the present program. The first version is an in-core version, where the complete file of accumulated contributions to transformed integrals in stored and updated in central memory. This version has been parallelized by distributing over a limited number of logical tasks the NATO steps corresponding to the scanning of the most external loop. The second version is an out-of-core version, in which twin files are alternatively used as input and output for the accumulated contributions to transformed integrals. This version is not parallel. The choice of one or another version and (for version 1) the determination of the number of tasks depends upon the balance between the available and the requested amounts of storage. The storage management and the choice of the proper version are carried out automatically using dynamic storage allocation. Both versions are vectorized and take advantage of the molecular symmetry. (orig.)

  3. An object-oriented programming paradigm for parallelization of computational fluid dynamics

    International Nuclear Information System (INIS)

    Ohta, Takashi.

    1997-03-01

    We propose an object-oriented programming paradigm for parallelization of scientific computing programs, and show that the approach can be a very useful strategy. Generally, parallelization of scientific programs tends to be complicated and unportable due to the specific requirements of each parallel computer or compiler. In this paper, we show that the object-oriented programming design, which separates the parallel processing parts from the solver of the applications, can achieve the large improvement in the maintenance of the codes, as well as the high portability. We design the program for the two-dimensional Euler equations according to the paradigm, and evaluate the parallel performance on IBM SP2. (author)

  4. Enhancing Application Performance Using Mini-Apps: Comparison of Hybrid Parallel Programming Paradigms

    Science.gov (United States)

    Lawson, Gary; Sosonkina, Masha; Baurle, Robert; Hammond, Dana

    2017-01-01

    In many fields, real-world applications for High Performance Computing have already been developed. For these applications to stay up-to-date, new parallel strategies must be explored to yield the best performance; however, restructuring or modifying a real-world application may be daunting depending on the size of the code. In this case, a mini-app may be employed to quickly explore such options without modifying the entire code. In this work, several mini-apps have been created to enhance a real-world application performance, namely the VULCAN code for complex flow analysis developed at the NASA Langley Research Center. These mini-apps explore hybrid parallel programming paradigms with Message Passing Interface (MPI) for distributed memory access and either Shared MPI (SMPI) or OpenMP for shared memory accesses. Performance testing shows that MPI+SMPI yields the best execution performance, while requiring the largest number of code changes. A maximum speedup of 23 was measured for MPI+SMPI, but only 11 was measured for MPI+OpenMP.

  5. Mobile and replicated alignment of arrays in data-parallel programs

    Science.gov (United States)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

    1993-01-01

    When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

  6. A learnable parallel processing architecture towards unity of memory and computing.

    Science.gov (United States)

    Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J

    2015-08-14

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.

  7. Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems

    KAUST Repository

    Mudigere, Dheevatsa

    2015-05-01

    In this work, we revisit the 1999 Gordon Bell Prize winning PETSc-FUN3D aerodynamics code, extending it with highly-tuned shared-memory parallelization and detailed performance analysis on modern highly parallel architectures. An unstructured-grid implicit flow solver, which forms the backbone of computational aerodynamics, poses particular challenges due to its large irregular working sets, unstructured memory accesses, and variable/limited amount of parallelism. This code, based on a domain decomposition approach, exposes tradeoffs between the number of threads assigned to each MPI-rank sub domain, and the total number of domains. By applying several algorithm- and architecture-aware optimization techniques for unstructured grids, we show a 6.9X speed-up in performance on a single-node Intel® XeonTM1 E5 2690 v2 processor relative to the out-of-the-box compilation. Our scaling studies on TACC Stampede supercomputer show that our optimizations continue to provide performance benefits over baseline implementation as we scale up to 256 nodes.

  8. A learnable parallel processing architecture towards unity of memory and computing

    Science.gov (United States)

    Li, H.; Gao, B.; Chen, Z.; Zhao, Y.; Huang, P.; Ye, H.; Liu, L.; Liu, X.; Kang, J.

    2015-08-01

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named “iMemComp”, where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped “iMemComp” with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on “iMemComp” can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.

  9. Studies of electron collisions with polyatomic molecules using distributed-memory parallel computers

    International Nuclear Information System (INIS)

    Winstead, C.; Hipes, P.G.; Lima, M.A.P.; McKoy, V.

    1991-01-01

    Elastic electron scattering cross sections from 5--30 eV are reported for the molecules C 2 H 4 , C 2 H 6 , C 3 H 8 , Si 2 H 6 , and GeH 4 , obtained using an implementation of the Schwinger multichannel method for distributed-memory parallel computer architectures. These results, obtained within the static-exchange approximation, are in generally good agreement with the available experimental data. These calculations demonstrate the potential of highly parallel computation in the study of collisions between low-energy electrons and polyatomic gases. The computational methodology discussed is also directly applicable to the calculation of elastic cross sections at higher levels of approximation (target polarization) and of electronic excitation cross sections

  10. An object-oriented bulk synchronous parallel library for multicore programming

    NARCIS (Netherlands)

    Yzelman, A.N.; Bisseling, R.H.

    2012-01-01

    We show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proof-of-concept MulticoreBSP library has

  11. A Tool for Performance Modeling of Parallel Programs

    Directory of Open Access Journals (Sweden)

    J.A. González

    2003-01-01

    Full Text Available Current performance prediction analytical models try to characterize the performance behavior of actual machines through a small set of parameters. In practice, substantial deviations are observed. These differences are due to factors as memory hierarchies or network latency. A natural approach is to associate a different proportionality constant with each basic block, and analogously, to associate different latencies and bandwidths with each "communication block". Unfortunately, to use this approach implies that the evaluation of parameters must be done for each algorithm. This is a heavy task, implying experiment design, timing, statistics, pattern recognition and multi-parameter fitting algorithms. Software support is required. We present a compiler that takes as source a C program annotated with complexity formulas and produces as output an instrumented code. The trace files obtained from the execution of the resulting code are analyzed with an interactive interpreter, giving us, among other information, the values of those parameters.

  12. A program system for ab initio MO calculations on vector and parallel processing machines. Pt. 1

    International Nuclear Information System (INIS)

    Ernenwein, R.; Rohmer, M.M.; Benard, M.

    1990-01-01

    We present a program system for ab initio molecular orbital calculations on vector and parallel computers. The present article is devoted to the computation of one- and two-electron integrals over contracted Gaussian basis sets involving s-, p-, d- and f-type functions. The McMurchie and Davidson (MMD) algorithm has been implemented and parallelized by distributing over a limited number of logical tasks the calculation of the 55 relevant classes of integrals. All sections of the MMD algorithm have been efficiently vectorized, leading to a scalar/vector ratio of 5.8. Different algorithms are proposed and compared for an optimal vectorization of the contraction of the 'intermediate integrals' generated by the MMD formalism. Advantage is taken of the dynamic storage allocation for tuning the length of the vector loops (i.e. the size of the vectorization buffer) as a function of (i) the total memory available for the job, (ii) the number of logical tasks defined by the user (≤13), and (iii) the storage requested by each specific class of integrals. Test calculations carried out on a CRAY-2 computer show that the average number of finite integrals computed over a (s, p, d, f) CGTO basis set is about 1180000 per second and per processor. The combination of vectorization and parallelism on this 4-processor machine reduces the CPU time by a factor larger than 20 with respect to the scalar and sequential performance. (orig.)

  13. Parallel Conjugate Gradient: Effects of Ordering Strategies, Programming Paradigms, and Architectural Platforms

    Science.gov (United States)

    Oliker, Leonid; Heber, Gerd; Biswas, Rupak

    2000-01-01

    The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations within a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming paradigms and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multi-threaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.

  14. Massive parallel electromagnetic field simulation program JEMS-FDTD design and implementation on jasmin

    International Nuclear Information System (INIS)

    Li Hanyu; Zhou Haijing; Dong Zhiwei; Liao Cheng; Chang Lei; Cao Xiaolin; Xiao Li

    2010-01-01

    A large-scale parallel electromagnetic field simulation program JEMS-FDTD(J Electromagnetic Solver-Finite Difference Time Domain) is designed and implemented on JASMIN (J parallel Adaptive Structured Mesh applications INfrastructure). This program can simulate propagation, radiation, couple of electromagnetic field by solving Maxwell equations on structured mesh explicitly with FDTD method. JEMS-FDTD is able to simulate billion-mesh-scale problems on thousands of processors. In this article, the program is verified by simulating the radiation of an electric dipole. A beam waveguide is simulated to demonstrate the capability of large scale parallel computation. A parallel performance test indicates that a high parallel efficiency is obtained. (authors)

  15. Parallel database search and prime factorization with magnonic holographic memory devices

    Energy Technology Data Exchange (ETDEWEB)

    Khitun, Alexander [Electrical and Computer Engineering Department, University of California - Riverside, Riverside, California 92521 (United States)

    2015-12-28

    In this work, we describe the capabilities of Magnonic Holographic Memory (MHM) for parallel database search and prime factorization. MHM is a type of holographic device, which utilizes spin waves for data transfer and processing. Its operation is based on the correlation between the phases and the amplitudes of the input spin waves and the output inductive voltage. The input of MHM is provided by the phased array of spin wave generating elements allowing the producing of phase patterns of an arbitrary form. The latter makes it possible to code logic states into the phases of propagating waves and exploit wave superposition for parallel data processing. We present the results of numerical modeling illustrating parallel database search and prime factorization. The results of numerical simulations on the database search are in agreement with the available experimental data. The use of classical wave interference may results in a significant speedup over the conventional digital logic circuits in special task data processing (e.g., √n in database search). Potentially, magnonic holographic devices can be implemented as complementary logic units to digital processors. Physical limitations and technological constrains of the spin wave approach are also discussed.

  16. Parallel database search and prime factorization with magnonic holographic memory devices

    Science.gov (United States)

    Khitun, Alexander

    2015-12-01

    In this work, we describe the capabilities of Magnonic Holographic Memory (MHM) for parallel database search and prime factorization. MHM is a type of holographic device, which utilizes spin waves for data transfer and processing. Its operation is based on the correlation between the phases and the amplitudes of the input spin waves and the output inductive voltage. The input of MHM is provided by the phased array of spin wave generating elements allowing the producing of phase patterns of an arbitrary form. The latter makes it possible to code logic states into the phases of propagating waves and exploit wave superposition for parallel data processing. We present the results of numerical modeling illustrating parallel database search and prime factorization. The results of numerical simulations on the database search are in agreement with the available experimental data. The use of classical wave interference may results in a significant speedup over the conventional digital logic circuits in special task data processing (e.g., √n in database search). Potentially, magnonic holographic devices can be implemented as complementary logic units to digital processors. Physical limitations and technological constrains of the spin wave approach are also discussed.

  17. Parallel database search and prime factorization with magnonic holographic memory devices

    International Nuclear Information System (INIS)

    Khitun, Alexander

    2015-01-01

    In this work, we describe the capabilities of Magnonic Holographic Memory (MHM) for parallel database search and prime factorization. MHM is a type of holographic device, which utilizes spin waves for data transfer and processing. Its operation is based on the correlation between the phases and the amplitudes of the input spin waves and the output inductive voltage. The input of MHM is provided by the phased array of spin wave generating elements allowing the producing of phase patterns of an arbitrary form. The latter makes it possible to code logic states into the phases of propagating waves and exploit wave superposition for parallel data processing. We present the results of numerical modeling illustrating parallel database search and prime factorization. The results of numerical simulations on the database search are in agreement with the available experimental data. The use of classical wave interference may results in a significant speedup over the conventional digital logic circuits in special task data processing (e.g., √n in database search). Potentially, magnonic holographic devices can be implemented as complementary logic units to digital processors. Physical limitations and technological constrains of the spin wave approach are also discussed

  18. Exploiting variability for energy optimization of parallel programs

    Energy Technology Data Exchange (ETDEWEB)

    Lavrijsen, Wim [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Iancu, Costin [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); de Jong, Wibe [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Chen, Xin [Georgia Inst. of Technology, Atlanta, GA (United States); Schwan, Karsten [Georgia Inst. of Technology, Atlanta, GA (United States)

    2016-04-18

    Here in this paper we present optimizations that use DVFS mechanisms to reduce the total energy usage in scientific applications. Our main insight is that noise is intrinsic to large scale parallel executions and it appears whenever shared resources are contended. The presence of noise allows us to identify and manipulate any program regions amenable to DVFS. When compared to previous energy optimizations that make per core decisions using predictions of the running time, our scheme uses a qualitative approach to recognize the signature of executions amenable to DVFS. By recognizing the "shape of variability" we can optimize codes with highly dynamic behavior, which pose challenges to all existing DVFS techniques. We validate our approach using offline and online analyses for one-sided and two-sided communication paradigms. We have applied our methods to NWChem, and we show best case improvements in energy use of 12% at no loss in performance when using online optimizations running on 720 Haswell cores with one-sided communication. With NWChem on MPI two-sided and offline analysis, capturing the initialization, we find energy savings of up to 20%, with less than 1% performance cost.

  19. PAREMD: A parallel program for the evaluation of momentum space properties of atoms and molecules

    Science.gov (United States)

    Meena, Deep Raj; Gadre, Shridhar R.; Balanarayan, P.

    2018-03-01

    The present work describes a code for evaluating the electron momentum density (EMD), its moments and the associated Shannon information entropy for a multi-electron molecular system. The code works specifically for electronic wave functions obtained from traditional electronic structure packages such as GAMESS and GAUSSIAN. For the momentum space orbitals, the general expression for Gaussian basis sets in position space is analytically Fourier transformed to momentum space Gaussian basis functions. The molecular orbital coefficients of the wave function are taken as an input from the output file of the electronic structure calculation. The analytic expressions of EMD are evaluated over a fine grid and the accuracy of the code is verified by a normalization check and a numerical kinetic energy evaluation which is compared with the analytic kinetic energy given by the electronic structure package. Apart from electron momentum density, electron density in position space has also been integrated into this package. The program is written in C++ and is executed through a Shell script. It is also tuned for multicore machines with shared memory through OpenMP. The program has been tested for a variety of molecules and correlated methods such as CISD, Møller-Plesset second order (MP2) theory and density functional methods. For correlated methods, the PAREMD program uses natural spin orbitals as an input. The program has been benchmarked for a variety of Gaussian basis sets for different molecules showing a linear speedup on a parallel architecture.

  20. Parallel main-memory indexing for moving-object query and update workloads

    DEFF Research Database (Denmark)

    Sidlauskas, Darius; Saltenis, Simonas; Jensen, Christian Søndergaard

    2012-01-01

    of supporting the location-related query and update workloads generated by very large populations of such moving objects. This paper presents a main-memory indexing technique that aims to support such workloads. The technique, called PGrid, uses a grid structure that is capable of exploiting the parallelism...... offered by modern processors. Unlike earlier proposals that maintain separate structures for updates and queries, PGrid allows both long-running queries and rapid updates to operate on a single data structure and thus offers up-to-date query results. Because PGrid does not rely on creating snapshots...... on the same current data-store state, PGrid outperforms snapshot-based techniques in terms of both query freshness and CPU cycle-wise efficiency....

  1. The numerical parallel computing of photon transport

    International Nuclear Information System (INIS)

    Huang Qingnan; Liang Xiaoguang; Zhang Lifa

    1998-12-01

    The parallel computing of photon transport is investigated, the parallel algorithm and the parallelization of programs on parallel computers both with shared memory and with distributed memory are discussed. By analyzing the inherent law of the mathematics and physics model of photon transport according to the structure feature of parallel computers, using the strategy of 'to divide and conquer', adjusting the algorithm structure of the program, dissolving the data relationship, finding parallel liable ingredients and creating large grain parallel subtasks, the sequential computing of photon transport into is efficiently transformed into parallel and vector computing. The program was run on various HP parallel computers such as the HY-1 (PVP), the Challenge (SMP) and the YH-3 (MPP) and very good parallel speedup has been gotten

  2. Expert Programmer versus Parallelizing Compiler: A Comparative Study of Two Approaches for Distributed Shared Memory

    Directory of Open Access Journals (Sweden)

    M. F. P. O'Boyle

    1996-01-01

    Full Text Available This article critically examines current parallel programming practice and optimizing compiler development. The general strategies employed by compiler and programmer to optimize a Fortran program are described, and then illustrated for a specific case by applying them to a well-known scientific program, TRED2, using the KSR-1 as the target architecture. Extensive measurement is applied to the resulting versions of the program, which are compared with a version produced by a commercial optimizing compiler, KAP. The compiler strategy significantly outperforms KAP and does not fall far short of the performance achieved by the programmer. Following the experimental section each approach is critiqued by the other. Perceived flaws, advantages, and common ground are outlined, with an eye to improving both schemes.

  3. MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems.

    Science.gov (United States)

    González-Domínguez, Jorge; Liu, Yongchao; Touriño, Juan; Schmidt, Bertil

    2016-12-15

    MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively. Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at http://msaprobs.sourceforge.net CONTACT: jgonzalezd@udc.esSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  4. The Medial Temporal Lobe – Conduit of Parallel Connectivity: A model for Attention, Memory, and Perception.

    Directory of Open Access Journals (Sweden)

    Brian B. Mozaffari

    2014-11-01

    Full Text Available Based on the notion that the brain is equipped with a hierarchical organization, which embodies environmental contingencies across many time scales, this paper suggests that the medial temporal lobe (MTL – located deep in the hierarchy – serves as a bridge connecting supra to infra – MTL levels. Bridging the upper and lower regions of the hierarchy provides a parallel architecture that optimizes information flow between upper and lower regions to aid attention, encoding, and processing of quick complex visual phenomenon. Bypassing intermediate hierarchy levels, information conveyed through the MTL ‘bridge’ allows upper levels to make educated predictions about the prevailing context and accordingly select lower representations to increase the efficiency of predictive coding throughout the hierarchy. This selection or activation/deactivation is associated with endogenous attention. In the event that these ‘bridge’ predictions are inaccurate, this architecture enables the rapid encoding of novel contingencies. A review of hierarchical models in relation to memory is provided along with a new theory, Medial-temporal-lobe Conduit for Parallel Connectivity (MCPC. In this scheme, consolidation is considered as a secondary process, occurring after a MTL-bridged connection, which eventually allows upper and lower levels to access each other directly. With repeated reactivations, as contingencies become consolidated, less MTL activity is predicted. Finally, MTL bridging may aid processing transient but structured perceptual events, by allowing communication between upper and lower levels without calling on intermediate levels of representation.

  5. Hardware system of parallel processing for fast CT image reconstruction based on circular shifting float memory architecture

    International Nuclear Information System (INIS)

    Wang Shi; Kang Kejun; Wang Jingjin

    1995-01-01

    Computerized Tomography (CT) is expected to become an inevitable diagnostic technique in the future. However, the long time required to reconstruct an image has been one of the major drawbacks associated with this technique. Parallel process is one of the best way to solve this problem. This paper gives the architecture and hardware design of PIRS-4 (4-processor Parallel Image Reconstruction System) which is a parallel processing system for fast 3D-CT image reconstruction by circular shifting float memory architecture. It includes structure and component of the system, the design of cross bar switch and details of control model. The test results are described

  6. Kemari: A Portable High Performance Fortran System for Distributed Memory Parallel Processors

    Directory of Open Access Journals (Sweden)

    T. Kamachi

    1997-01-01

    Full Text Available We have developed a compilation system which extends High Performance Fortran (HPF in various aspects. We support the parallelization of well-structured problems with loop distribution and alignment directives similar to HPF's data distribution directives. Such directives give both additional control to the user and simplify the compilation process. For the support of unstructured problems, we provide directives for dynamic data distribution through user-defined mappings. The compiler also allows integration of message-passing interface (MPI primitives. The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer. After an overview of the compiler, we describe the language extensions and related compilation mechanisms in detail. Performance measurements demonstrate the compiler's applicability to a variety of application classes.

  7. On the interplay between working memory consolidation and attentional selection in controlling conscious access : Parallel processing at a cost-a comment on 'The interplay of attention and consciousness in visual search, attentional blink and working memory consolidation'

    NARCIS (Netherlands)

    Wyble, Brad; Bowman, Howard; Nieuwenstein, Mark

    On the interplay between working memory consolidation and attentional selection in controlling conscious access: parallel processing at a cost-a comment on 'The interplay of attention and consciousness in visual search, attentional blink and working memory consolidation'

  8. Effects of a Memory Training Program in Older People with Severe Memory Loss

    Science.gov (United States)

    Mateos, Pedro M.; Valentin, Alberto; González-Tablas, Maria del Mar; Espadas, Verónica; Vera, Juan L.; Jorge, Inmaculada García

    2016-01-01

    Strategies based memory training programs are widely used to enhance the cognitive abilities of the elderly. Participants in these training programs are usually people whose mental abilities remain intact. Occasionally, people with cognitive impairment also participate. The aim of this study was to test if memory training designed specifically for…

  9. MPI_XSTAR: MPI-based Parallelization of the XSTAR Photoionization Program

    Science.gov (United States)

    Danehkar, Ashkbiz; Nowak, Michael A.; Lee, Julia C.; Smith, Randall K.

    2018-02-01

    We describe a program for the parallel implementation of multiple runs of XSTAR, a photoionization code that is used to predict the physical properties of an ionized gas from its emission and/or absorption lines. The parallelization program, called MPI_XSTAR, has been developed and implemented in the C++ language by using the Message Passing Interface (MPI) protocol, a conventional standard of parallel computing. We have benchmarked parallel multiprocessing executions of XSTAR, using MPI_XSTAR, against a serial execution of XSTAR, in terms of the parallelization speedup and the computing resource efficiency. Our experience indicates that the parallel execution runs significantly faster than the serial execution, however, the efficiency in terms of the computing resource usage decreases with increasing the number of processors used in the parallel computing.

  10. Parallel Libraries to support High-Level Programming

    DEFF Research Database (Denmark)

    Larsen, Morten Nørgaard

    and the Microsoft .NET iv framework. Normally, one would not directly think of the .NET framework when talking scientific applications, but Microsoft has in the last couple of versions of .NET introduce a number of tools for writing parallel and high performance code. The first section examines how programmers can...

  11. On the effective parallel programming of multi-core processors

    NARCIS (Netherlands)

    Varbanescu, A.L.

    2010-01-01

    Multi-core processors are considered now the only feasible alternative to the large single-core processors which have become limited by technological aspects such as power consumption and heat dissipation. However, due to their inherent parallel structure and their diversity, multi-cores are

  12. HPC parallel programming model for gyrokinetic MHD simulation

    International Nuclear Information System (INIS)

    Naitou, Hiroshi; Yamada, Yusuke; Tokuda, Shinji; Ishii, Yasutomo; Yagi, Masatoshi

    2011-01-01

    The 3-dimensional gyrokinetic PIC (particle-in-cell) code for MHD simulation, Gpic-MHD, was installed on SR16000 (“Plasma Simulator”), which is a scalar cluster system consisting of 8,192 logical cores. The Gpic-MHD code advances particle and field quantities in time. In order to distribute calculations over large number of logical cores, the total simulation domain in cylindrical geometry was broken up into N DD-r × N DD-z (number of radial decomposition times number of axial decomposition) small domains including approximately the same number of particles. The axial direction was uniformly decomposed, while the radial direction was non-uniformly decomposed. N RP replicas (copies) of each decomposed domain were used (“particle decomposition”). The hybrid parallelization model of multi-threads and multi-processes was employed: threads were parallelized by the auto-parallelization and N DD-r × N DD-z × N RP processes were parallelized by MPI (message-passing interface). The parallelization performance of Gpic-MHD was investigated for the medium size system of N r × N θ × N z = 1025 × 128 × 128 mesh with 4.196 or 8.192 billion particles. The highest speed for the fixed number of logical cores was obtained for two threads, the maximum number of N DD-z , and optimum combination of N DD-r and N RP . The observed optimum speeds demonstrated good scaling up to 8,192 logical cores. (author)

  13. Overview of the Force Scientific Parallel Language

    Directory of Open Access Journals (Sweden)

    Gita Alaghband

    1994-01-01

    Full Text Available The Force parallel programming language designed for large-scale shared-memory multiprocessors is presented. The language provides a number of parallel constructs as extensions to the ordinary Fortran language and is implemented as a two-level macro preprocessor to support portability across shared memory multiprocessors. The global parallelism model on which the Force is based provides a powerful parallel language. The parallel constructs, generic synchronization, and freedom from process management supported by the Force has resulted in structured parallel programs that are ported to the many multiprocessors on which the Force is implemented. Two new parallel constructs for looping and functional decomposition are discussed. Several programming examples to illustrate some parallel programming approaches using the Force are also presented.

  14. Adapting high-level language programs for parallel processing using data flow

    Science.gov (United States)

    Standley, Hilda M.

    1988-01-01

    EASY-FLOW, a very high-level data flow language, is introduced for the purpose of adapting programs written in a conventional high-level language to a parallel environment. The level of parallelism provided is of the large-grained variety in which parallel activities take place between subprograms or processes. A program written in EASY-FLOW is a set of subprogram calls as units, structured by iteration, branching, and distribution constructs. A data flow graph may be deduced from an EASY-FLOW program.

  15. Communications oriented programming of parallel iterative solutions of sparse linear systems

    Science.gov (United States)

    Patrick, M. L.; Pratt, T. W.

    1986-01-01

    Parallel algorithms are developed for a class of scientific computational problems by partitioning the problems into smaller problems which may be solved concurrently. The effectiveness of the resulting parallel solutions is determined by the amount and frequency of communication and synchronization and the extent to which communication can be overlapped with computation. Three different parallel algorithms for solving the same class of problems are presented, and their effectiveness is analyzed from this point of view. The algorithms are programmed using a new programming environment. Run-time statistics and experience obtained from the execution of these programs assist in measuring the effectiveness of these algorithms.

  16. Compiler and Runtime Support for Programming in Adaptive Parallel Environments

    Science.gov (United States)

    1998-10-15

    noother job is waiting for resources, and use a smaller number of processors when other jobs needresources. Setia et al. [15, 20] have shown that such...15] Vijay K. Naik, Sanjeev Setia , and Mark Squillante. Performance analysis of job scheduling policiesin parallel supercomputing environments. In...on networks ofheterogeneous workstations. Technical Report CSE-94-012, Oregon Graduate Institute of Scienceand Technology, 1994.[20] Sanjeev Setia

  17. Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model

    KAUST Repository

    Hamam, Alwaleed A.

    2017-03-13

    Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it\\'s time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.

  18. Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model

    KAUST Repository

    Hamam, Alwaleed A.; Khan, Ayaz H.

    2017-01-01

    Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it's time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.

  19. Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

    Science.gov (United States)

    Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

    2012-01-10

    The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.

  20. Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

    Science.gov (United States)

    Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

    2008-01-01

    The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.

  1. User's guide of parallel program development environment (PPDE). The 2nd edition

    International Nuclear Information System (INIS)

    Ueno, Hirokazu; Takemiya, Hiroshi; Imamura, Toshiyuki; Koide, Hiroshi; Matsuda, Katsuyuki; Higuchi, Kenji; Hirayama, Toshio; Ohta, Hirofumi

    2000-03-01

    The STA basic system has been enhanced to accelerate support for parallel programming on heterogeneous parallel computers, through a series of R and D on the technology of parallel processing. The enhancement has been made through extending the function of the PPDF, Parallel Program Development Environment in the STA basic system. The extended PPDE has the function to make: 1) the automatic creation of a 'makefile' and a shell script file for its execution, 2) the multi-tools execution which makes the tools on heterogeneous computers to execute with one operation a task on a computer, and 3) the mirror composition to reflect editing results of a file on a computer into all related files on other computers. These additional functions will enhance the work efficiency for program development on some computers. More functions have been added to the PPDE to provide help for parallel program development. New functions were also designed to complement a HPF translator and a parallelizing support tool when working together so that a sequential program is efficiently converted to a parallel program. This report describes the use of extended PPDE. (author)

  2. Sleep Benefits in Parallel Implicit and Explicit Measures of Episodic Memory

    Science.gov (United States)

    Weber, Frederik D.; Wang, Jing-Yi; Born, Jan; Inostroza, Marion

    2014-01-01

    Research in rats using preferences during exploration as a measure of memory has indicated that sleep is important for the consolidation of episodic-like memory, i.e., memory for an event bound into specific spatio-temporal context. How these findings relate to human episodic memory is unclear. We used spontaneous preferences during visual…

  3. ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems.

    Science.gov (United States)

    González-Domínguez, Jorge; Expósito, Roberto R

    2018-01-01

    Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.

  4. Protocol-Based Verification of Message-Passing Parallel Programs

    DEFF Research Database (Denmark)

    López-Acosta, Hugo-Andrés; Eduardo R. B. Marques, Eduardo R. B.; Martins, Francisco

    2015-01-01

    We present ParTypes, a type-based methodology for the verification of Message Passing Interface (MPI) programs written in the C programming language. The aim is to statically verify programs against protocol specifications, enforcing properties such as fidelity and absence of deadlocks. We develo...

  5. Feasibility studies for a high energy physics MC program on massive parallel platforms

    International Nuclear Information System (INIS)

    Bertolotto, L.M.; Peach, K.J.; Apostolakis, J.; Bruschini, C.E.; Calafiura, P.; Gagliardi, F.; Metcalf, M.; Norton, A.; Panzer-Steindel, B.

    1994-01-01

    The parallelization of a Monte Carlo program for the NA48 experiment is presented. As a first step, a task farming structure was realized. Based on this, a further step, making use of a distributed database for showers in the electro-magnetic calorimeter, was implemented. Further possibilities for using parallel processing for a quasi-real time calibration of the calorimeter are described

  6. Cell verification of parallel burnup calculation program MCBMPI based on MPI

    International Nuclear Information System (INIS)

    Yang Wankui; Liu Yaoguang; Ma Jimin; Wang Guanbo; Yang Xin; She Ding

    2014-01-01

    The parallel burnup calculation program MCBMPI was developed. The program was modularized. The parallel MCNP5 program MCNP5MPI was employed as neutron transport calculation module. And a composite of three solution methods was used to solve burnup equation, i.e. matrix exponential technique, TTA analytical solution, and Gauss Seidel iteration. MPI parallel zone decomposition strategy was concluded in the program. The program system only consists of MCNP5MPI and burnup subroutine. The latter achieves three main functions, i.e. zone decomposition, nuclide transferring and decaying, and data exchanging with MCNP5MPI. Also, the program was verified with the pressurized water reactor (PWR) cell burnup benchmark. The results show that it,s capable to apply the program to burnup calculation of multiple zones, and the computation efficiency could be significantly improved with the development of computer hardware. (authors)

  7. 76 FR 62808 - Pilot Program for Parallel Review of Medical Products

    Science.gov (United States)

    2011-10-11

    ... voluntary participation in the pilot program, as well as the guiding principles the Agencies intend to... 57045), parallel review is intended to reduce the time between FDA marketing approval and CMS national...

  8. F-Nets and Software Cabling: Deriving a Formal Model and Language for Portable Parallel Programming

    Science.gov (United States)

    DiNucci, David C.; Saini, Subhash (Technical Monitor)

    1998-01-01

    Parallel programming is still being based upon antiquated sequence-based definitions of the terms "algorithm" and "computation", resulting in programs which are architecture dependent and difficult to design and analyze. By focusing on obstacles inherent in existing practice, a more portable model is derived here, which is then formalized into a model called Soviets which utilizes a combination of imperative and functional styles. This formalization suggests more general notions of algorithm and computation, as well as insights into the meaning of structured programming in a parallel setting. To illustrate how these principles can be applied, a very-high-level graphical architecture-independent parallel language, called Software Cabling, is described, with many of the features normally expected from today's computer languages (e.g. data abstraction, data parallelism, and object-based programming constructs).

  9. The technical feasibility of uranium enrichment for nuclear bomb construction at the parallel nuclear program plant

    International Nuclear Information System (INIS)

    Rosa, L.P.

    1990-01-01

    It is discussed the hole of the Parallel Nuclear Program is Brazil and the feasibility of uranium enrichment for nuclear bomb construction. This program involves two research centers, one belonging to the brazilian navy and another to the aeronautics. Some other brazilian institutes like CTA, IPEN, COPESP and CETEX and also taking part in the program. (A.C.A.S.)

  10. Memory for radio advertisements: the effect of program and typicality.

    Science.gov (United States)

    Martín-Luengo, Beatriz; Luna, Karlos; Migueles, Malen

    2013-01-01

    We examined the influence of the type of radio program on the memory for radio advertisements. We also investigated the role in memory of the typicality (high or low) of the elements of the products advertised. Participants listened to three types of programs (interesting, boring, enjoyable) with two advertisements embedded in each. After completing a filler task, the participants performed a true/false recognition test. Hits and false alarm rates were higher for the interesting and enjoyable programs than for the boring one. There were also more hits and false alarms for the high-typicality elements. The response criterion for the advertisements embedded in the boring program was stricter than for the advertisements in other types of programs. We conclude that the type of program in which an advertisement is inserted and the nature of the elements of the advertisement affect both the number of hits and false alarms and the response criterion, but not the accuracy of the memory.

  11. Development and benchmark verification of a parallelized Monte Carlo burnup calculation program MCBMPI

    International Nuclear Information System (INIS)

    Yang Wankui; Liu Yaoguang; Ma Jimin; Yang Xin; Wang Guanbo

    2014-01-01

    MCBMPI, a parallelized burnup calculation program, was developed. The program is modularized. Neutron transport calculation module employs the parallelized MCNP5 program MCNP5MPI, and burnup calculation module employs ORIGEN2, with the MPI parallel zone decomposition strategy. The program system only consists of MCNP5MPI and an interface subroutine. The interface subroutine achieves three main functions, i.e. zone decomposition, nuclide transferring and decaying, data exchanging with MCNP5MPI. Also, the program was verified with the Pressurized Water Reactor (PWR) cell burnup benchmark, the results showed that it's capable to apply the program to burnup calculation of multiple zones, and the computation efficiency could be significantly improved with the development of computer hardware. (authors)

  12. Vdebug: debugging tool for parallel scientific programs. Design report on vdebug

    International Nuclear Information System (INIS)

    Matsuda, Katsuyuki; Takemiya, Hiroshi

    2000-02-01

    We report on a debugging tool called vdebug which supports debugging work for parallel scientific simulation programs. It is difficult to debug scientific programs with an existing debugger, because the volume of data generated by the programs is too large for users to check data in characters. Usually, the existing debugger shows data values in characters. To alleviate it, we have developed vdebug which enables to check the validity of large amounts of data by showing these data values visually. Although targets of vdebug have been restricted to sequential programs, we have made it applicable to parallel programs by realizing the function of merging and visualizing data distributed on programs on each computer node. Now, vdebug works on seven kinds of parallel computers. In this report, we describe the design of vdebug. (author)

  13. User's guide of parallel program development environment (PPDE). The 2nd edition

    Energy Technology Data Exchange (ETDEWEB)

    Ueno, Hirokazu; Takemiya, Hiroshi; Imamura, Toshiyuki; Koide, Hiroshi; Matsuda, Katsuyuki; Higuchi, Kenji; Hirayama, Toshio [Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, Tokyo (Japan); Ohta, Hirofumi [Hitachi Ltd., Tokyo (Japan)

    2000-03-01

    The STA basic system has been enhanced to accelerate support for parallel programming on heterogeneous parallel computers, through a series of R and D on the technology of parallel processing. The enhancement has been made through extending the function of the PPDF, Parallel Program Development Environment in the STA basic system. The extended PPDE has the function to make: 1) the automatic creation of a 'makefile' and a shell script file for its execution, 2) the multi-tools execution which makes the tools on heterogeneous computers to execute with one operation a task on a computer, and 3) the mirror composition to reflect editing results of a file on a computer into all related files on other computers. These additional functions will enhance the work efficiency for program development on some computers. More functions have been added to the PPDE to provide help for parallel program development. New functions were also designed to complement a HPF translator and a paralleilizing support tool when working together so that a sequential program is efficiently converted to a parallel program. This report describes the use of extended PPDE. (author)

  14. User's guide of parallel program development environment (PPDE). The 2nd edition

    Energy Technology Data Exchange (ETDEWEB)

    Ueno, Hirokazu; Takemiya, Hiroshi; Imamura, Toshiyuki; Koide, Hiroshi; Matsuda, Katsuyuki; Higuchi, Kenji; Hirayama, Toshio [Center for Promotion of Computational Science and Engineering, Japan Atomic Energy Research Institute, Tokyo (Japan); Ohta, Hirofumi [Hitachi Ltd., Tokyo (Japan)

    2000-03-01

    The STA basic system has been enhanced to accelerate support for parallel programming on heterogeneous parallel computers, through a series of R and D on the technology of parallel processing. The enhancement has been made through extending the function of the PPDF, Parallel Program Development Environment in the STA basic system. The extended PPDE has the function to make: 1) the automatic creation of a 'makefile' and a shell script file for its execution, 2) the multi-tools execution which makes the tools on heterogeneous computers to execute with one operation a task on a computer, and 3) the mirror composition to reflect editing results of a file on a computer into all related files on other computers. These additional functions will enhance the work efficiency for program development on some computers. More functions have been added to the PPDE to provide help for parallel program development. New functions were also designed to complement a HPF translator and a paralleilizing support tool when working together so that a sequential program is efficiently converted to a parallel program. This report describes the use of extended PPDE. (author)

  15. Time complexity analysis for distributed memory computers: implementation of parallel conjugate gradient method

    NARCIS (Netherlands)

    Hoekstra, A.G.; Sloot, P.M.A.; Haan, M.J.; Hertzberger, L.O.; van Leeuwen, J.

    1991-01-01

    New developments in Computer Science, both hardware and software, offer researchers, such as physicists, unprecedented possibilities to solve their computational intensive problems.However, full exploitation of e.g. new massively parallel computers, parallel languages or runtime environments

  16. Parallel statistical image reconstruction for cone-beam x-ray CT on a shared memory computation platform

    International Nuclear Information System (INIS)

    Kole, J S; Beekman, F J

    2005-01-01

    Statistical reconstruction methods offer possibilities of improving image quality as compared to analytical methods, but current reconstruction times prohibit routine clinical applications. To reduce reconstruction times we have parallelized a statistical reconstruction algorithm for cone-beam x-ray CT, the ordered subset convex algorithm (OSC), and evaluated it on a shared memory computer. Two different parallelization strategies were developed: one that employs parallelism by computing the work for all projections within a subset in parallel, and one that divides the total volume into parts and processes the work for each sub-volume in parallel. Both methods are used to reconstruct a three-dimensional mathematical phantom on two different grid densities. The reconstructed images are binary identical to the result of the serial (non-parallelized) algorithm. The speed-up factor equals approximately 30 when using 32 to 40 processors, and scales almost linearly with the number of cpus for both methods. The huge reduction in computation time allows us to apply statistical reconstruction to clinically relevant studies for the first time

  17. On the Performance of the Python Programming Language for Serial and Parallel Scientific Computations

    Directory of Open Access Journals (Sweden)

    Xing Cai

    2005-01-01

    Full Text Available This article addresses the performance of scientific applications that use the Python programming language. First, we investigate several techniques for improving the computational efficiency of serial Python codes. Then, we discuss the basic programming techniques in Python for parallelizing serial scientific applications. It is shown that an efficient implementation of the array-related operations is essential for achieving good parallel performance, as for the serial case. Once the array-related operations are efficiently implemented, probably using a mixed-language implementation, good serial and parallel performance become achievable. This is confirmed by a set of numerical experiments. Python is also shown to be well suited for writing high-level parallel programs.

  18. Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.

    Science.gov (United States)

    Pan, Tony; Flick, Patrick; Jain, Chirag; Liu, Yongchao; Aluru, Srinivas

    2017-10-09

    Counting and indexing fixed length substrings, or k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the datasets at the current generation rate of 1.8 terabases every 3 days. We present Kmerind, a high performance parallel k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Kmerind's k-mer counter performs similarly or better than the best existing k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts k-mers in a 120 GB sequence read dataset in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1% of the k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. Kmerind is the first k-mer indexing library for distributed memory environments, and the first extensible library for general k-mer indexing and counting. Kmerind is available at https://github.com/ParBLiSS/kmerind.

  19. Process-Oriented Parallel Programming with an Application to Data-Intensive Computing

    OpenAIRE

    Givelberg, Edward

    2014-01-01

    We introduce process-oriented programming as a natural extension of object-oriented programming for parallel computing. It is based on the observation that every class of an object-oriented language can be instantiated as a process, accessible via a remote pointer. The introduction of process pointers requires no syntax extension, identifies processes with programming objects, and enables processes to exchange information simply by executing remote methods. Process-oriented programming is a h...

  20. Development of parallel 3D discrete ordinates transport program on JASMIN framework

    International Nuclear Information System (INIS)

    Cheng, T.; Wei, J.; Shen, H.; Zhong, B.; Deng, L.

    2015-01-01

    A parallel 3D discrete ordinates radiation transport code JSNT-S is developed, aiming at simulating real-world radiation shielding and reactor physics applications in a reasonable time. Through the patch-based domain partition algorithm, the memory requirement is shared among processors and a space-angle parallel sweeping algorithm is developed based on data-driven algorithm. Acceleration methods such as partial current rebalance are implemented. The correctness is proved through the VENUS-3 and other benchmark models. In the radiation shielding calculation of the Qinshan-II reactor pressure vessel model with 24.3 billion DoF, only 88 seconds is required and the overall parallel efficiency of 44% is achieved on 1536 CPU cores. (author)

  1. Implementations of BLAST for parallel computers.

    Science.gov (United States)

    Jülich, A

    1995-02-01

    The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.

  2. Programming a massively parallel, computation universal system: static behavior

    Energy Technology Data Exchange (ETDEWEB)

    Lapedes, A.; Farber, R.

    1986-01-01

    In previous work by the authors, the ''optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ''neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). In this series of papers that approach is extended to programming nets that contain interneurons (sometimes called ''hidden neurons''), and thus deals with nets capable of universal computation. 22 refs.

  3. Concurrent Programming Using Actors: Exploiting Large-Scale Parallelism,

    Science.gov (United States)

    1985-10-07

    ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT. TASK* Artificial Inteligence Laboratory AREA Is WORK UNIT NUMBERS 545 Technology Square...D-R162 422 CONCURRENT PROGRMMIZNG USING f"OS XL?ITP TEH l’ LARGE-SCALE PARALLELISH(U) NASI AC E Al CAMBRIDGE ARTIFICIAL INTELLIGENCE L. G AGHA ET AL...RESOLUTION TEST CHART N~ATIONAL BUREAU OF STANDA.RDS - -96 A -E. __ _ __ __’ .,*- - -- •. - MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL

  4. COMPSs-Mobile: parallel programming for mobile-cloud computing

    OpenAIRE

    Lordan Gomis, Francesc-Josep; Badia Sala, Rosa Maria

    2016-01-01

    The advent of Cloud and the popularization of mobile devices have led us to a shift in computing access. Computing users will have an interaction display while the real computation will be performed remotely, in the Cloud. COMPSs-Mobile is a framework that aims to ease the development of energy-efficient and high-performing applications for this environment. The framework provides an infrastructure-unaware programming model that allows developers to code regular Android applications that, ...

  5. Contributions for the optimization of the extensibility of parallel programing of turbulent plasmas

    International Nuclear Information System (INIS)

    Rozar, F.

    2015-01-01

    The work realized through this thesis focuses on the optimization of the Gysela code which simulates a plasma turbulence. Optimization of a scientific application concerns mainly one of the three following points: 1) the simulation of larger meshes, 2) the reduction of computing time and 3) the enhancement of the computation accuracy. The first part of this manuscript presents the contributions relative to the simulation of larger mesh. Alike many simulation codes, getting more realistic simulations is often analogous to rene the meshes. The finer the mesh the larger the memory consumption. Moreover, during these last few years, the supercomputers had trend to provide less and less memory per computer core. For these reasons, we have developed a library, the libMTM (Modeling and Tracing Memory), dedicated to study precisely the memory consumption of parallel softwares. The libMTM tools allowed us to reduce the memory consumption of Gysela and to study its scalability. As far as we know, there is no other tool which provides equivalent features which allow the memory scalability study. The second part of the manuscript presents the works relative to the optimization of the computation time and the improvement of accuracy of the gyro-average operator. This operator represents a corner stone of the gyrokinetic model which is used by the Gysela application. The improvement of accuracy emanates from a change in the computing method: a scheme based on a 2D Hermite interpolation substitutes the Pade approximation. Although the new version of the gyro-average operator is more accurate, it is also more expensive in computation time than the former one. In order to keep the simulation in reasonable time, different optimizations have been performed on the new computing method to get it competitive. Finally, we have developed a MPI parallelized version of the new gyro-average operator. The good scalability of this new gyro-average computer will allow, eventually, a reduction

  6. Compiling the parallel programming language NestStep to the CELL processor

    OpenAIRE

    Holm, Magnus

    2010-01-01

    The goal of this project is to create a source-to-source compiler which will translate NestStep code to C code. The compiler's job is to replace NestStep constructs with a series of function calls to the NestStep runtime system. NestStep is a parallel programming language extension based on the BSP model. It adds constructs for parallel programming on top of an imperative programming language. For this project, only constructs extending the C language are relevant. The output code will compil...

  7. 76 FR 66309 - Pilot Program for Parallel Review of Medical Products; Correction

    Science.gov (United States)

    2011-10-26

    ... DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Medicare and Medicaid Services [CMS-3180-N2] Food and Drug Administration [Docket No. FDA-2010-N-0308] Pilot Program for Parallel Review of Medical... 11, 2011 (76 FR 62808). The document announced a pilot program for sponsors of innovative device...

  8. Episodic and Semantic Memories of a Residential Environmental Education Program

    Science.gov (United States)

    Knapp, Doug; Benton, Gregory M.

    2006-01-01

    This study used a phenomenological approach to investigate the recollections of participants of an environmental education (EE) residential program. Ten students who participated in a residential EE program in the fall of 2001 were interviewed in the fall of 2002. Three major themes relating to the participants' long-term memory of the residential…

  9. Enhancing memory self-efficacy during menopause through a group memory strategies program.

    Science.gov (United States)

    Unkenstein, Anne E; Bei, Bei; Bryant, Christina A

    2017-05-01

    Anxiety about memory during menopause can affect quality of life. We aimed to improve memory self-efficacy during menopause using a group memory strategies program. The program was run five times for a total of 32 peri- and postmenopausal women, age between 47 and 60 years, recruited from hospital menopause and gynecology clinics. The 4-week intervention consisted of weekly 2-hour sessions, and covered how memory works, memory changes related to ageing, health and lifestyle factors, and specific memory strategies. Memory contentment (CT), reported frequency of forgetting (FF), use of memory strategies, psychological distress, and attitude toward menopause were measured. A double-baseline design was applied, with outcomes measured on two baseline occasions (1-month prior [T1] and in the first session [T2]), immediately postintervention (T3), and 3-month postintervention (T4). To describe changes in each variable between time points paired sample t tests were conducted. Mixed-effects models comparing the means of random slopes from T2 to T3 with those from T1 to T2 were conducted for each variable to test for treatment effects. Examination of the naturalistic changes in outcome measures from T1 to T2 revealed no significant changes (all Ps > 0.05). CT, reported FF, and use of memory strategies improved significantly more from T2 to T3, than from T1 to T2 (all Ps attitude toward menopause nor psychological distress improved significantly more postintervention than during the double-baseline (all Ps > 0.05). Improvements in reported CT and FF were maintained after 3 months. The use of group interventions to improve memory self-efficacy during menopause warrants continued evaluation.

  10. Sleep benefits in parallel implicit and explicit measures of episodic memory

    Science.gov (United States)

    Weber, Frederik D.; Wang, Jing-Yi; Born, Jan; Inostroza, Marion

    2014-01-01

    Research in rats using preferences during exploration as a measure of memory has indicated that sleep is important for the consolidation of episodic-like memory, i.e., memory for an event bound into specific spatio-temporal context. How these findings relate to human episodic memory is unclear. We used spontaneous preferences during visual exploration and verbal recall as, respectively, implicit and explicit measures of memory, to study effects of sleep on episodic memory consolidation in humans. During encoding before 10-h retention intervals that covered nighttime sleep or daytime wakefulness, two groups of young adults were presented with two episodes that were 1-h apart. Each episode entailed a spatial configuration of four different faces in a 3 × 3 grid of locations. After the retention interval, implicit spatio-temporal recall performance was assessed by eye-tracking visual exploration of another configuration of four faces of which two were from the first and second episode, respectively; of the two faces one was presented at the same location as during encoding and the other at another location. Afterward explicit verbal recall was assessed. Measures of implicit and explicit episodic memory retention were positively correlated (r = 0.57, P memory recall was associated with increased fast spindles during nonrapid eye movement (NonREM) sleep (r = 0.62, P memory benefits from sleep. PMID:24634354

  11. Repeated Stimulation of Cultured Networks of Rat Cortical Neurons Induces Parallel Memory Traces

    Science.gov (United States)

    le Feber, Joost; Witteveen, Tim; van Veenendaal, Tamar M.; Dijkstra, Jelle

    2015-01-01

    During systems consolidation, memories are spontaneously replayed favoring information transfer from hippocampus to neocortex. However, at present no empirically supported mechanism to accomplish a transfer of memory from hippocampal to extra-hippocampal sites has been offered. We used cultured neuronal networks on multielectrode arrays and…

  12. Concurrent Collections (CnC): A new approach to parallel programming

    CERN Multimedia

    CERN. Geneva

    2010-01-01

    A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer o...

  13. Optical interconnection network for parallel access to multi-rank memory in future computing systems.

    Science.gov (United States)

    Wang, Kang; Gu, Huaxi; Yang, Yintang; Wang, Kun

    2015-08-10

    With the number of cores increasing, there is an emerging need for a high-bandwidth low-latency interconnection network, serving core-to-memory communication. In this paper, aiming at the goal of simultaneous access to multi-rank memory, we propose an optical interconnection network for core-to-memory communication. In the proposed network, the wavelength usage is delicately arranged so that cores can communicate with different ranks at the same time and broadcast for flow control can be achieved. A distributed memory controller architecture that works in a pipeline mode is also designed for efficient optical communication and transaction address processes. The scaling method and wavelength assignment for the proposed network are investigated. Compared with traditional electronic bus-based core-to-memory communication, the simulation results based on the PARSEC benchmark show that the bandwidth enhancement and latency reduction are apparent.

  14. A multithreaded parallel implementation of a dynamic programming algorithm for sequence comparison.

    Science.gov (United States)

    Martins, W S; Del Cuvillo, J B; Useche, F J; Theobald, K B; Gao, G R

    2001-01-01

    This paper discusses the issues involved in implementing a dynamic programming algorithm for biological sequence comparison on a general-purpose parallel computing platform based on a fine-grain event-driven multithreaded program execution model. Fine-grain multithreading permits efficient parallelism exploitation in this application both by taking advantage of asynchronous point-to-point synchronizations and communication with low overheads and by effectively tolerating latency through the overlapping of computation and communication. We have implemented our scheme on EARTH, a fine-grain event-driven multithreaded execution and architecture model which has been ported to a number of parallel machines with off-the-shelf processors. Our experimental results show that the dynamic programming algorithm can be efficiently implemented on EARTH systems with high performance (e.g., speedup of 90 on 120 nodes), good programmability and reasonable cost.

  15. Injecting Artificial Memory Errors Into a Running Computer Program

    Science.gov (United States)

    Bornstein, Benjamin J.; Granat, Robert A.; Wagstaff, Kiri L.

    2008-01-01

    Single-event upsets (SEUs) or bitflips are computer memory errors caused by radiation. BITFLIPS (Basic Instrumentation Tool for Fault Localized Injection of Probabilistic SEUs) is a computer program that deliberately injects SEUs into another computer program, while the latter is running, for the purpose of evaluating the fault tolerance of that program. BITFLIPS was written as a plug-in extension of the open-source Valgrind debugging and profiling software. BITFLIPS can inject SEUs into any program that can be run on the Linux operating system, without needing to modify the program s source code. Further, if access to the original program source code is available, BITFLIPS offers fine-grained control over exactly when and which areas of memory (as specified via program variables) will be subjected to SEUs. The rate of injection of SEUs is controlled by specifying either a fault probability or a fault rate based on memory size and radiation exposure time, in units of SEUs per byte per second. BITFLIPS can also log each SEU that it injects and, if program source code is available, report the magnitude of effect of the SEU on a floating-point value or other program variable.

  16. Parallel Programming Application to Matrix Algebra in the Spectral Method for Control Systems Analysis, Synthesis and Identification

    Directory of Open Access Journals (Sweden)

    V. Yu. Kleshnin

    2016-01-01

    Full Text Available The article describes the matrix algebra libraries based on the modern technologies of parallel programming for the Spectrum software, which can use a spectral method (in the spectral form of mathematical description to analyse, synthesise and identify deterministic and stochastic dynamical systems. The developed matrix algebra libraries use the following technologies for the GPUs: OmniThreadLibrary, OpenMP, Intel Threading Building Blocks, Intel Cilk Plus for CPUs nVidia CUDA, OpenCL, and Microsoft Accelerated Massive Parallelism.The developed libraries support matrices with real elements (single and double precision. The matrix dimensions are limited by 32-bit or 64-bit memory model and computer configuration. These libraries are general-purpose and can be used not only for the Spectrum software. They can also find application in the other projects where there is a need to perform operations with large matrices.The article provides a comparative analysis of the libraries developed for various matrix operations (addition, subtraction, scalar multiplication, multiplication, powers of matrices, tensor multiplication, transpose, inverse matrix, finding a solution of the system of linear equations through the numerical experiments using different CPU and GPU. The article contains sample programs and performance test results for matrix multiplication, which requires most of all computational resources in regard to the other operations.

  17. High performance parallelism pearls 2 multicore and many-core programming approaches

    CERN Document Server

    Jeffers, Jim

    2015-01-01

    High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming - illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of t

  18. A highly efficient parallel algorithm for solving the neutron diffusion nodal equations on shared-memory computers

    International Nuclear Information System (INIS)

    Azmy, Y.Y.; Kirk, B.L.

    1990-01-01

    Modern parallel computer architectures offer an enormous potential for reducing CPU and wall-clock execution times of large-scale computations commonly performed in various applications in science and engineering. Recently, several authors have reported their efforts in developing and implementing parallel algorithms for solving the neutron diffusion equation on a variety of shared- and distributed-memory parallel computers. Testing of these algorithms for a variety of two- and three-dimensional meshes showed significant speedup of the computation. Even for very large problems (i.e., three-dimensional fine meshes) executed concurrently on a few nodes in serial (nonvector) mode, however, the measured computational efficiency is very low (40 to 86%). In this paper, the authors present a highly efficient (∼85 to 99.9%) algorithm for solving the two-dimensional nodal diffusion equations on the Sequent Balance 8000 parallel computer. Also presented is a model for the performance, represented by the efficiency, as a function of problem size and the number of participating processors. The model is validated through several tests and then extrapolated to larger problems and more processors to predict the performance of the algorithm in more computationally demanding situations

  19. Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10.

    Science.gov (United States)

    Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P

    2014-10-30

    Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine. Copyright © 2014 Wiley Periodicals, Inc.

  20. High performance parallel computers for science: New developments at the Fermilab advanced computer program

    International Nuclear Information System (INIS)

    Nash, T.; Areti, H.; Atac, R.

    1988-08-01

    Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs

  1. Empirical valence bond models for reactive potential energy surfaces: a parallel multilevel genetic program approach.

    Science.gov (United States)

    Bellucci, Michael A; Coker, David F

    2011-07-28

    We describe a new method for constructing empirical valence bond potential energy surfaces using a parallel multilevel genetic program (PMLGP). Genetic programs can be used to perform an efficient search through function space and parameter space to find the best functions and sets of parameters that fit energies obtained by ab initio electronic structure calculations. Building on the traditional genetic program approach, the PMLGP utilizes a hierarchy of genetic programming on two different levels. The lower level genetic programs are used to optimize coevolving populations in parallel while the higher level genetic program (HLGP) is used to optimize the genetic operator probabilities of the lower level genetic programs. The HLGP allows the algorithm to dynamically learn the mutation or combination of mutations that most effectively increase the fitness of the populations, causing a significant increase in the algorithm's accuracy and efficiency. The algorithm's accuracy and efficiency is tested against a standard parallel genetic program with a variety of one-dimensional test cases. Subsequently, the PMLGP is utilized to obtain an accurate empirical valence bond model for proton transfer in 3-hydroxy-gamma-pyrone in gas phase and protic solvent. © 2011 American Institute of Physics

  2. Combined spatial/angular domain decomposition SN algorithms for shared memory parallel machines

    International Nuclear Information System (INIS)

    Hunter, M.A.; Haghighat, A.

    1993-01-01

    Several parallel processing algorithms on the basis of spatial and angular domain decomposition methods are developed and incorporated into a two-dimensional discrete ordinates transport theory code. These algorithms divide the spatial and angular domains into independent subdomains so that the flux calculations within each subdomain can be processed simultaneously. Two spatial parallel algorithms (Block-Jacobi, red-black), one angular parallel algorithm (η-level), and their combinations are implemented on an eight processor CRAY Y-MP. Parallel performances of the algorithms are measured using a series of fixed source RZ geometry problems. Some of the results are also compared with those executed on an IBM 3090/600J machine. (orig.)

  3. Reversal of Cocaine-Associated Synaptic Plasticity in Medial Prefrontal Cortex Parallels Elimination of Memory Retrieval.

    Science.gov (United States)

    Otis, James M; Mueller, Devin

    2017-09-01

    Addiction is characterized by abnormalities in prefrontal cortex that are thought to allow drug-associated cues to drive compulsive drug seeking and taking. Identification and reversal of these pathologic neuroadaptations are therefore critical for treatment of addiction. Previous studies using rodents reveal that drugs of abuse cause dendritic spine plasticity in prelimbic medial prefrontal cortex (PL-mPFC) pyramidal neurons, a phenomenon that correlates with the strength of drug-associated memories in vivo. Thus, we hypothesized that cocaine-evoked plasticity in PL-mPFC may underlie cocaine-associated memory retrieval, and therefore disruption of this plasticity would prevent retrieval. Indeed, using patch clamp electrophysiology we find that cocaine place conditioning increases excitatory presynaptic and postsynaptic transmission in rat PL-mPFC pyramidal neurons. This was accounted for by increases in excitatory presynaptic release, paired-pulse facilitation, and increased AMPA receptor transmission. Noradrenergic signaling is known to maintain glutamatergic plasticity upon reactivation of modified circuits, and we therefore next determined whether inhibition of noradrenergic signaling during memory reactivation would reverse the cocaine-evoked plasticity and/or disrupt the cocaine-associated memory. We find that administration of the β-adrenergic receptor antagonist propranolol before memory retrieval, but not after (during memory reconsolidation), reverses the cocaine-evoked presynaptic and postsynaptic modifications in PL-mPFC and causes long-lasting memory impairments. Taken together, these data reveal that cocaine-evoked synaptic plasticity in PL-mPFC is reversible in vivo, and suggest a novel strategy that would allow normalization of prefrontal circuitry in addiction.

  4. The energy band memory server algorithm for parallel Monte Carlo transport calculations

    International Nuclear Information System (INIS)

    Felker, K.G.; Siegel, A.R.; Smith, K.S.; Romano, P.K.; Forget, B.

    2013-01-01

    An algorithm is developed to significantly reduce the on-node footprint of cross section memory in Monte Carlo particle tracking algorithms. The classic method of per-node replication of cross section data is replaced by a memory server model, in which the read-only lookup tables reside on a remote set of disjoint processors. The main particle tracking algorithm is then modified in such a way as to enable efficient use of the remotely stored data in the particle tracking algorithm. Results of a prototype code on a Blue Gene/Q installation reveal that the penalty for remote storage is reasonable in the context of time scales for real-world applications, thus yielding a path forward for a broad range of applications that are memory bound using current techniques. (authors)

  5. Towards scalable parallelism in Monte Carlo particle transport codes using remote memory access

    International Nuclear Information System (INIS)

    Romano, Paul K.; Forget, Benoit; Brown, Forrest

    2010-01-01

    One forthcoming challenge in the area of high-performance computing is having the ability to run large-scale problems while coping with less memory per compute node. In this work, we investigate a novel data decomposition method that would allow Monte Carlo transport calculations to be performed on systems with limited memory per compute node. In this method, each compute node remotely retrieves a small set of geometry and cross-section data as needed and remotely accumulates local tallies when crossing the boundary of the local spatial domain. Initial results demonstrate that while the method does allow large problems to be run in a memory-limited environment, achieving scalability may be difficult due to inefficiencies in the current implementation of RMA operations. (author)

  6. Parallel Stories, Differentiated Histories. Exploring Fiction and Memory in Spanish and Portuguese Television

    Directory of Open Access Journals (Sweden)

    José Carlos Rueda Laffond

    2013-06-01

    Full Text Available Integrated into an international project on the characteristics of historical fiction on TV in Spain and Portugal during 2001–2012, the study traces the main aspects of these productions as entertainment products and memory strategies. Historical fiction on Iberian television channels express qualitative problems of interpretation. Its development must be related to issues such practices, meanings and forms of recognition, and connected with specific memory systems. The article explores a set of key–points: uses and topics of historical fiction; its visions through similarly proposals; polarization in several historical times, and its convergent perspectives about Franco and Oliveira Salazar as Iberian contemporary dictators.

  7. Three-dimensional magnetic field computation on a distributed memory parallel processor

    International Nuclear Information System (INIS)

    Barion, M.L.

    1990-01-01

    The analysis of three-dimensional magnetic fields by finite element methods frequently proves too onerous a task for the computing resource on which it is attempted. When non-linear and transient effects are included, it may become impossible to calculate the field distribution to sufficient resolution. One approach to this problem is to exploit the natural parallelism in the finite element method via parallel processing. This paper reports on an implementation of a finite element code for non-linear three-dimensional low-frequency magnetic field calculation on Intel's iPSC/2

  8. Generalized Analytical Program of Thyristor Phase Control Circuit with Series and Parallel Resonance Load

    OpenAIRE

    Nakanishi, Sen-ichiro; Ishida, Hideaki; Himei, Toyoji

    1981-01-01

    The systematic analytical method is reqUired for the ac phase control circuit by means of an inverse parallel thyristor pair which has a series and parallel L-C resonant load, because the phase control action causes abnormal and interesting phenomena, such as an extreme increase of voltage and current, an unique increase and decrease of contained higher harmonics, and a wide variation of power factor, etc. In this paper, the program for the analysis of the thyristor phase control circuit with...

  9. Serine/threonine-protein phosphatase 1 α levels are paralleling olfactory memory formation in the CD1 mouse.

    Science.gov (United States)

    Winding, Christiana; Sun, Yanwei; Höger, Harald; Bubna-Littitz, Hermann; Pollak, Arnold; Schmidt, Peter; Lubec, Gert

    2011-06-01

    Although olfactory discrimination has already been studied in several mouse strains, data on protein levels linked to olfactory memory are limited. Wild mouse strains Mus musculus musculus, Mus musculus domesticus and CD1 laboratory outbred mice were tested in a conditioned odor preference task and trained to discriminate between two odors, Rose and Lemon, by pairing one odor with a sugar reward. Six hours following the final test, mice were sacrificed and olfactory bulbs (OB) were taken for gel-based proteomics analyses and immunoblotting. OB proteins were extracted, separated by 2-DE and quantified using specific software (Proteomweaver). Odor-trained mice showed a preference for the previously rewarded odor suggesting that conditioned odor preference occurred. In CD1 mice levels, one out of 482 protein spots was significantly increased in odor-trained mice as compared with the control group; it was in-gel digested by trypsin and chymotrypsin and analyzed by tandem mass spectrometry (nano-ESI-LC-MS/MS). The spot was unambiguously identified as serine/threonine-protein phosphatase PP1-α catalytic subunit (PP-1A) and differential levels observed in gel-based proteomic studies were verified by immunoblotting. PP-1A is a key signalling element in synaptic plasticity and memory processes and is herein shown to be paralleling olfactory discrimination representing olfactory memory. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  10. Teaching Scientific Computing: A Model-Centered Approach to Pipeline and Parallel Programming with C

    Directory of Open Access Journals (Sweden)

    Vladimiras Dolgopolovas

    2015-01-01

    Full Text Available The aim of this study is to present an approach to the introduction into pipeline and parallel computing, using a model of the multiphase queueing system. Pipeline computing, including software pipelines, is among the key concepts in modern computing and electronics engineering. The modern computer science and engineering education requires a comprehensive curriculum, so the introduction to pipeline and parallel computing is the essential topic to be included in the curriculum. At the same time, the topic is among the most motivating tasks due to the comprehensive multidisciplinary and technical requirements. To enhance the educational process, the paper proposes a novel model-centered framework and develops the relevant learning objects. It allows implementing an educational platform of constructivist learning process, thus enabling learners’ experimentation with the provided programming models, obtaining learners’ competences of the modern scientific research and computational thinking, and capturing the relevant technical knowledge. It also provides an integral platform that allows a simultaneous and comparative introduction to pipelining and parallel computing. The programming language C for developing programming models and message passing interface (MPI and OpenMP parallelization tools have been chosen for implementation.

  11. Parallel low-memory quasi-Newton optimization algorithm for molecular structure

    Czech Academy of Sciences Publication Activity Database

    Klemsa, Jakub; Řezáč, Jan

    2013-01-01

    Roč. 584, Oct 1 (2013), s. 10-13 ISSN 0009-2614 R&D Projects: GA ČR GP13-01214P Institutional support: RVO:61388963 Keywords : geometry optimization * parallelization * molecular graph Subject RIV: CF - Physical ; Theoretical Chemistry Impact factor: 1.991, year: 2013

  12. Hardware-Oblivious Parallelism for In-Memory Column-Stores

    NARCIS (Netherlands)

    M. Heimel; M. Saecker; H. Pirk (Holger); S. Manegold (Stefan); V. Markl

    2013-01-01

    htmlabstractThe multi-core architectures of today’s computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone

  13. A Hybrid Shared-Memory Parallel Max-Tree Algorithm for Extreme Dynamic-Range Images.

    Science.gov (United States)

    Moschini, Ugo; Meijster, Arnold; Wilkinson, Michael H F

    2018-03-01

    Max-trees, or component trees, are graph structures that represent the connected components of an image in a hierarchical way. Nowadays, many application fields rely on images with high-dynamic range or floating point values. Efficient sequential algorithms exist to build trees and compute attributes for images of any bit depth. However, we show that the current parallel algorithms perform poorly already with integers at bit depths higher than 16 bits per pixel. We propose a parallel method combining the two worlds of flooding and merging max-tree algorithms. First, a pilot max-tree of a quantized version of the image is built in parallel using a flooding method. Later, this structure is used in a parallel leaf-to-root approach to compute efficiently the final max-tree and to drive the merging of the sub-trees computed by the threads. We present an analysis of the performance both on simulated and actual 2D images and 3D volumes. Execution times are about better than the fastest sequential algorithm and speed-up goes up to on 64 threads.

  14. Academic training: From Evolution Theory to Parallel and Distributed Genetic Programming

    CERN Multimedia

    2007-01-01

    2006-2007 ACADEMIC TRAINING PROGRAMME LECTURE SERIES 15, 16 March From 11:00 to 12:00 - Main Auditorium, bldg. 500 From Evolution Theory to Parallel and Distributed Genetic Programming F. FERNANDEZ DE VEGA / Univ. of Extremadura, SP Lecture No. 1: From Evolution Theory to Evolutionary Computation Evolutionary computation is a subfield of artificial intelligence (more particularly computational intelligence) involving combinatorial optimization problems, which are based to some degree on the evolution of biological life in the natural world. In this tutorial we will review the source of inspiration for this metaheuristic and its capability for solving problems. We will show the main flavours within the field, and different problems that have been successfully solved employing this kind of techniques. Lecture No. 2: Parallel and Distributed Genetic Programming The successful application of Genetic Programming (GP, one of the available Evolutionary Algorithms) to optimization problems has encouraged an ...

  15. Comparison of systems for memory allocation in the C programming language

    OpenAIRE

    Zavrtanik, Matej

    2016-01-01

    The bachelor thesis describes memory allocation. Work begins with description of mechanism, system calls and data structures used in memory allocators. Goals of memory allocation ares listed along with problems which must be avoided. Afterwards construction and allocating of popular memory allocators is described. Work ends with comparison of memory allocators based on time of execution of programs and memory usage, on which conclusion is based.

  16. Algorithmic differentiation of pragma-defined parallel regions differentiating computer programs containing OpenMP

    CERN Document Server

    Förster, Michael

    2014-01-01

    Numerical programs often use parallel programming techniques such as OpenMP to compute the program's output values as efficient as possible. In addition, derivative values of these output values with respect to certain input values play a crucial role. To achieve code that computes not only the output values simultaneously but also the derivative values, this work introduces several source-to-source transformation rules. These rules are based on a technique called algorithmic differentiation. The main focus of this work lies on the important reverse mode of algorithmic differentiation. The inh

  17. Parallel interactive retrieval of item and associative information from event memory.

    Science.gov (United States)

    Cox, Gregory E; Criss, Amy H

    2017-09-01

    Memory contains information about individual events (items) and combinations of events (associations). Despite the fundamental importance of this distinction, it remains unclear exactly how these two kinds of information are stored and whether different processes are used to retrieve them. We use both model-independent qualitative properties of response dynamics and quantitative modeling of individuals to address these issues. Item and associative information are not independent and they are retrieved concurrently via interacting processes. During retrieval, matching item and associative information mutually facilitate one another to yield an amplified holistic signal. Modeling of individuals suggests that this kind of facilitation between item and associative retrieval is a ubiquitous feature of human memory. Copyright © 2017 Elsevier Inc. All rights reserved.

  18. Run-Time and Compiler Support for Programming in Adaptive Parallel Environments

    Directory of Open Access Journals (Sweden)

    Guy Edjlali

    1997-01-01

    Full Text Available For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at run-time. In this article, we discuss run-time support for data-parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a run-time library to provide this support. We discuss how the run-time library can be used by compilers of high-performance Fortran (HPF-like languages to generate code for an adaptive environment. We present performance results for a Navier-Stokes solver and a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network of nondedicated workstations, which are likely to be an important resource for parallel programming in the future.

  19. P3T+: A Performance Estimator for Distributed and Parallel Programs

    Directory of Open Access Journals (Sweden)

    T. Fahringer

    2000-01-01

    Full Text Available Developing distributed and parallel programs on today's multiprocessor architectures is still a challenging task. Particular distressing is the lack of effective performance tools that support the programmer in evaluating changes in code, problem and machine sizes, and target architectures. In this paper we introduce P3T+ which is a performance estimator for mostly regular HPF (High Performance Fortran programs but partially covers also message passing programs (MPI. P3T+ is unique by modeling programs, compiler code transformations, and parallel and distributed architectures. It computes at compile-time a variety of performance parameters including work distribution, number of transfers, amount of data transferred, transfer times, computation times, and number of cache misses. Several novel technologies are employed to compute these parameters: loop iteration spaces, array access patterns, and data distributions are modeled by employing highly effective symbolic analysis. Communication is estimated by simulating the behavior of a communication library used by the underlying compiler. Computation times are predicted through pre-measured kernels on every target architecture of interest. We carefully model most critical architecture specific factors such as cache lines sizes, number of cache lines available, startup times, message transfer time per byte, etc. P3T+ has been implemented and is closely integrated with the Vienna High Performance Compiler (VFC to support programmers develop parallel and distributed applications. Experimental results for realistic kernel codes taken from real-world applications are presented to demonstrate both accuracy and usefulness of P3T+.

  20. Methodologies and Tools for Tuning Parallel Programs: 80% Art, 20% Science, and 10% Luck

    Science.gov (United States)

    Yan, Jerry C.; Bailey, David (Technical Monitor)

    1996-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. In the past few years, the ubiquitous introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CRI's Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance instrumentation/monitor/tuning technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g. AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  1. Waking Up Buried Memories of Old TV Programs.

    Science.gov (United States)

    Larzabal, Christelle; Bacon-Macé, Nadège; Muratot, Sophie; Thorpe, Simon J

    2017-01-01

    Although it has been demonstrated that visual and auditory stimuli can be recalled decades after the initial exposure, previous studies have generally not ruled out the possibility that the material may have been seen or heard during the intervening period. Evidence shows that reactivations of a long-term memory trace play a role in its update and maintenance. In the case of remote or very long-term memories, it is most likely that these reactivations are triggered by the actual re-exposure to the stimulus. In this study we decided to explore whether it is possible to recall stimuli that could not have been re-experienced in the intervening period. We tested the ability of French participants ( N = 34, 31 female) to recall 50 TV programs broadcast on average for the last time 44 years ago (from the 60's and early 70's). Potential recall was elicited by the presentation of short audiovisual excerpts of these TV programs. The absence of potential re-exposure to the material was strictly controlled by selecting TV programs that have never been rebroadcast and were not available in the public domain. Our results show that six TV programs were particularly well identified on average across the 34 participants with a median percentage of 71.7% ( SD = 13.6, range: 48.5-87.9%). We also obtained 50 single case reports with associated information about the viewing of 23 TV programs including the 6 previous ones. More strikingly, for two cases, retrieval of the title was made spontaneously without the need of a four-proposition choice. These results suggest that re-exposures to the stimuli are not necessary to maintain a memory for a lifetime. These new findings raise fundamental questions about the underlying mechanisms used by the brain to store these very old sensory memories.

  2. Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

    KAUST Repository

    Quintin, Jean-Noel

    2013-10-01

    Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon\\'s algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid-1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon\\'s algorithm as it can be used on a nonsquare number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene/P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores. © 2013 IEEE.

  3. Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

    KAUST Repository

    Quintin, Jean-Noel; Hasanov, Khalid; Lastovetsky, Alexey

    2013-01-01

    Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon's algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid-1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon's algorithm as it can be used on a nonsquare number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene/P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores. © 2013 IEEE.

  4. Enabling Requirements-Based Programming for Highly-Dependable Complex Parallel and Distributed Systems

    Science.gov (United States)

    Hinchey, Michael G.; Rash, James L.; Rouff, Christopher A.

    2005-01-01

    The manual application of formal methods in system specification has produced successes, but in the end, despite any claims and assertions by practitioners, there is no provable relationship between a manually derived system specification or formal model and the customer's original requirements. Complex parallel and distributed system present the worst case implications for today s dearth of viable approaches for achieving system dependability. No avenue other than formal methods constitutes a serious contender for resolving the problem, and so recognition of requirements-based programming has come at a critical juncture. We describe a new, NASA-developed automated requirement-based programming method that can be applied to certain classes of systems, including complex parallel and distributed systems, to achieve a high degree of dependability.

  5. Dynamic programming in parallel boundary detection with application to ultrasound intima-media segmentation.

    Science.gov (United States)

    Zhou, Yuan; Cheng, Xinyao; Xu, Xiangyang; Song, Enmin

    2013-12-01

    Segmentation of carotid artery intima-media in longitudinal ultrasound images for measuring its thickness to predict cardiovascular diseases can be simplified as detecting two nearly parallel boundaries within a certain distance range, when plaque with irregular shapes is not considered. In this paper, we improve the implementation of two dynamic programming (DP) based approaches to parallel boundary detection, dual dynamic programming (DDP) and piecewise linear dual dynamic programming (PL-DDP). Then, a novel DP based approach, dual line detection (DLD), which translates the original 2-D curve position to a 4-D parameter space representing two line segments in a local image segment, is proposed to solve the problem while maintaining efficiency and rotation invariance. To apply the DLD to ultrasound intima-media segmentation, it is imbedded in a framework that employs an edge map obtained from multiplication of the responses of two edge detectors with different scales and a coupled snake model that simultaneously deforms the two contours for maintaining parallelism. The experimental results on synthetic images and carotid arteries of clinical ultrasound images indicate improved performance of the proposed DLD compared to DDP and PL-DDP, with respect to accuracy and efficiency. Copyright © 2013 Elsevier B.V. All rights reserved.

  6. A backtracking algorithm for the stream AND-parallel execution of logic programs

    Energy Technology Data Exchange (ETDEWEB)

    Somogyi, Z.; Ramamohanarao, K.; Vaghani, J. (Univ. of Melbourne, Parkville (Australia))

    1988-06-01

    The authors present the first backtracking algorithm for stream AND-parallel logic programs. It relies on compile-time knowledge of the data flow graph of each clause to let it figure out efficiently which goals to kill or restart when a goal fails. This crucial information, which they derive from mode declarations, was not available at compile-time in any previous stream AND-parallel system. They show that modes can increase the precision of the backtracking algorithm, though their algorithm allows this precision to be traded off against overhead on a procedure-by-procedure and call-by-call basis. The modes also allow their algorithm to handle efficiently programs that manipulate partially instantiated data structures and an important class of programs with circular dependency graphs. On code that does not need backtracking, the efficiency of their algorithm approaches that of the committed-choice languages; on code that does need backtracking its overhead is comparable to that of the independent AND-parallel backtracking algorithms.

  7. Center for Programming Models for Scalable Parallel Computing - Towards Enhancing OpenMP for Manycore and Heterogeneous Nodes

    Energy Technology Data Exchange (ETDEWEB)

    Barbara Chapman

    2012-02-01

    OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.

  8. Frontal cortex and hippocampus neurotransmitter receptor complex level parallels spatial memory performance in the radial arm maze.

    Science.gov (United States)

    Shanmugasundaram, Bharanidharan; Sase, Ajinkya; Miklosi, András G; Sialana, Fernando J; Subramaniyan, Saraswathi; Aher, Yogesh D; Gröger, Marion; Höger, Harald; Bennett, Keiryn L; Lubec, Gert

    2015-08-01

    Several neurotransmitter receptors have been proposed to be involved in memory formation. However, information on receptor complexes (RCs) in the radial arm maze (RAM) is missing. It was therefore the aim of this study to determine major neurotransmitter RCs levels that are modulated by RAM training because receptors are known to work in homo-or heteromeric assemblies. Immediate early gene Arc expression was determined by immunohistochemistry to show if prefrontal cortices (PFC) and hippocampi were activated following RAM training as these regions are known to be mainly implicated in spatial memory. Twelve rats per group, trained and untrained in the twelve arm RAM were used, frontal cortices and hippocampi were taken, RCs in membrane protein were quantified by blue-native PAGE immunoblotting. RCs components were characterised by co-immunoprecipitation followed by mass spectrometrical analysis and by the use of the proximity ligation assay. Arc expression was significantly higher in PFC of trained as compared to untrained rats whereas it was comparable in hippocampi. Frontal cortical levels of RCs containing AMPA receptors GluA1, GluA2, NMDA receptors GluN1 and GluN2A, dopamine receptor D1, acetylcholine nicotinic receptor alpha 7 (nAChR-α7) and hippocampal levels of RCs containing D1, GluN1, GluN2B and nAChR-α7 were increased in the trained group; phosphorylated dopamine transporter levels were decreased in the trained group. D1 and GluN1 receptors were shown to be in the same complex. Taken together, distinct RCs were paralleling performance in the RAM which is relevant for interpretation of previous and design of future work on RCs in memory studies. Copyright © 2015 Elsevier B.V. All rights reserved.

  9. Parallel definition of tear film maps on distributed-memory clusters for the support of dry eye diagnosis.

    Science.gov (United States)

    González-Domínguez, Jorge; Remeseiro, Beatriz; Martín, María J

    2017-02-01

    The analysis of the interference patterns on the tear film lipid layer is a useful clinical test to diagnose dry eye syndrome. This task can be automated with a high degree of accuracy by means of the use of tear film maps. However, the time required by the existing applications to generate them prevents a wider acceptance of this method by medical experts. Multithreading has been previously successfully employed by the authors to accelerate the tear film map definition on multicore single-node machines. In this work, we propose a hybrid message-passing and multithreading parallel approach that further accelerates the generation of tear film maps by exploiting the computational capabilities of distributed-memory systems such as multicore clusters and supercomputers. The algorithm for drawing tear film maps is parallelized using Message Passing Interface (MPI) for inter-node communications and the multithreading support available in the C++11 standard for intra-node parallelization. The original algorithm is modified to reduce the communications and increase the scalability. The hybrid method has been tested on 32 nodes of an Intel cluster (with two 12-core Haswell 2680v3 processors per node) using 50 representative images. Results show that maximum runtime is reduced from almost two minutes using the previous only-multithreaded approach to less than ten seconds using the hybrid method. The hybrid MPI/multithreaded implementation can be used by medical experts to obtain tear film maps in only a few seconds, which will significantly accelerate and facilitate the diagnosis of the dry eye syndrome. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  10. Development, Verification and Validation of Parallel, Scalable Volume of Fluid CFD Program for Propulsion Applications

    Science.gov (United States)

    West, Jeff; Yang, H. Q.

    2014-01-01

    There are many instances involving liquid/gas interfaces and their dynamics in the design of liquid engine powered rockets such as the Space Launch System (SLS). Some examples of these applications are: Propellant tank draining and slosh, subcritical condition injector analysis for gas generators, preburners and thrust chambers, water deluge mitigation for launch induced environments and even solid rocket motor liquid slag dynamics. Commercially available CFD programs simulating gas/liquid interfaces using the Volume of Fluid approach are currently limited in their parallel scalability. In 2010 for instance, an internal NASA/MSFC review of three commercial tools revealed that parallel scalability was seriously compromised at 8 cpus and no additional speedup was possible after 32 cpus. Other non-interface CFD applications at the time were demonstrating useful parallel scalability up to 4,096 processors or more. Based on this review, NASA/MSFC initiated an effort to implement a Volume of Fluid implementation within the unstructured mesh, pressure-based algorithm CFD program, Loci-STREAM. After verification was achieved by comparing results to the commercial CFD program CFD-Ace+, and validation by direct comparison with data, Loci-STREAM-VoF is now the production CFD tool for propellant slosh force and slosh damping rate simulations at NASA/MSFC. On these applications, good parallel scalability has been demonstrated for problems sizes of tens of millions of cells and thousands of cpu cores. Ongoing efforts are focused on the application of Loci-STREAM-VoF to predict the transient flow patterns of water on the SLS Mobile Launch Platform in order to support the phasing of water for launch environment mitigation so that vehicle determinantal effects are not realized.

  11. The FORCE: A portable parallel programming language supporting computational structural mechanics

    Science.gov (United States)

    Jordan, Harry F.; Benten, Muhammad S.; Brehm, Juergen; Ramanan, Aruna

    1989-01-01

    This project supports the conversion of codes in Computational Structural Mechanics (CSM) to a parallel form which will efficiently exploit the computational power available from multiprocessors. The work is a part of a comprehensive, FORTRAN-based system to form a basis for a parallel version of the NICE/SPAR combination which will form the CSM Testbed. The software is macro-based and rests on the force methodology developed by the principal investigator in connection with an early scientific multiprocessor. Machine independence is an important characteristic of the system so that retargeting it to the Flex/32, or any other multiprocessor on which NICE/SPAR might be imnplemented, is well supported. The principal investigator has experience in producing parallel software for both full and sparse systems of linear equations using the force macros. Other researchers have used the Force in finite element programs. It has been possible to rapidly develop software which performs at maximum efficiency on a multiprocessor. The inherent machine independence of the system also means that the parallelization will not be limited to a specific multiprocessor.

  12. Parallel rendering

    Science.gov (United States)

    Crockett, Thomas W.

    1995-01-01

    This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.

  13. Shared Variable Oriented Parallel Precompiler for SPMD Model

    Institute of Scientific and Technical Information of China (English)

    1995-01-01

    For the moment,commercial parallel computer systems with distributed memory architecture are usually provided with parallel FORTRAN or parallel C compliers,which are just traditional sequential FORTRAN or C compilers expanded with communication statements.Programmers suffer from writing parallel programs with communication statements. The Shared Variable Oriented Parallel Precompiler (SVOPP) proposed in this paper can automatically generate appropriate communication statements based on shared variables for SPMD(Single Program Multiple Data) computation model and greatly ease the parallel programming with high communication efficiency.The core function of parallel C precompiler has been successfully verified on a transputer-based parallel computer.Its prominent performance shows that SVOPP is probably a break-through in parallel programming technique.

  14. Compiler Technology for Parallel Scientific Computation

    Directory of Open Access Journals (Sweden)

    Can Özturan

    1994-01-01

    Full Text Available There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL. Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.

  15. Two Parallel Pathways Assign Opposing Odor Valences during Drosophila Memory Formation

    Directory of Open Access Journals (Sweden)

    Daisuke Yamazaki

    2018-02-01

    Full Text Available During olfactory associative learning in Drosophila, odors activate specific subsets of intrinsic mushroom body (MB neurons. Coincident exposure to either rewards or punishments is thought to activate extrinsic dopaminergic neurons, which modulate synaptic connections between odor-encoding MB neurons and MB output neurons to alter behaviors. However, here we identify two classes of intrinsic MB γ neurons based on cAMP response element (CRE-dependent expression, γCRE-p and γCRE-n, which encode aversive and appetitive valences. γCRE-p and γCRE-n neurons act antagonistically to maintain neutral valences for neutral odors. Activation or inhibition of either cell type upsets this balance, toggling odor preferences to either positive or negative values. The mushroom body output neurons, MBON-γ5β′2a/β′2mp and MBON-γ2α′1, mediate the actions of γCRE-p and γCRE-n neurons. Our data indicate that MB neurons encode valence information, as well as odor information, and this information is integrated through a process involving MBONs to regulate learning and memory.

  16. The parallel impact of episodic memory and episodic future thinking on food intake.

    Science.gov (United States)

    Vartanian, Lenny R; Chen, William H; Reily, Natalie M; Castel, Alan D

    2016-06-01

    This research examined the effects of both episodic memory and episodic future thinking (EFT) on snack food intake. In Study 1, female participants (n = 158) were asked to recall their lunch from earlier in the day, to think about the dinner they planned to have later in the day, or to think about a non-food activity before taking part in a cookie taste test. Participants who recalled their lunch or who thought about their dinner ate less than did participants who thought about non-food activities. These effects were not explained by group differences in the hedonic value of the food. Study 2 examined whether the suppression effect observed in Study 1 was driven by a general health consciousness. Female participants (n = 74) were asked to think about their past or future exercise (or a non-exercise activity), but thinking about exercise had no impact on participants' cookie consumption. Overall, both thinking about past food intake and imagining future food intake had the same suppression effect on participants' current food intake, but further research is needed to determine the underlying mechanism. Copyright © 2016 Elsevier Ltd. All rights reserved.

  17. Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1997-12-31

    This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.

  18. ANDRA's Long-Term Memory-Preservation Program

    International Nuclear Information System (INIS)

    Charton, Patrick; Dumont, Jean-Noel

    2012-01-01

    Maintaining the memory of repositories over the long term is required not only to ensure safety and reversibility (legal requirements), but also in response to social expectations. Since 2010, ANDRA has been implementing a long-term memory preservation program to reinforce and diversify its current arrangements in that field, as well as explore opportunities to extend memory keeping over thousands of years. As a reference solution, ANDRA uses the arrangement and practices in place at its surface disposal facilities. Although at ANDRA they are well aware of the fact that the operational time of a LILW disposal corresponds more or less with the implementation time of a geological disposal, useful insights can nevertheless be gained from the experience with short-lived LILW. The reference solution implemented at the Centre de la Manche and de l'Aube includes five memorization devises, differentiating between passive and active memory keeping. There are three passive memories, consisting of copies with various degrees of information on permanent paper, and two active memories, consisting of oral transmissions under the form of events and meetings with various (local) stakeholders. For HLW, ANDRA calculates a timescale of 200 years for the total implementation of a geological disposal, and a minimum of memory of 500 years afterwards, which necessitates conservation of RK and M for 700 years in total. Permanent paper lasts 600-1000 years. ANDRA also developed a sapphire disk which can contain large amounts of records and endure 1 million years. In fact this device created more questions; the purpose is exactly to question 'solutions' that are solely based on engineering. It for instance evokes questions such as 'Which languages should we use, which graphical material should we add, how can we avoid vandalism, what meaning will future generations give to the traces we leave?' All these issues and more are being investigated under ANDRA's long term memory preservation

  19. Programs for Testing Processor-in-Memory Computing Systems

    Science.gov (United States)

    Katz, Daniel S.

    2006-01-01

    The Multithreaded Microbenchmarks for Processor-In-Memory (PIM) Compilers, Simulators, and Hardware are computer programs arranged in a series for use in testing the performances of PIM computing systems, including compilers, simulators, and hardware. The programs at the beginning of the series test basic functionality; the programs at subsequent positions in the series test increasingly complex functionality. The programs are intended to be used while designing a PIM system, and can be used to verify that compilers, simulators, and hardware work correctly. The programs can also be used to enable designers of these system components to examine tradeoffs in implementation. Finally, these programs can be run on non-PIM hardware (either single-threaded or multithreaded) using the POSIX pthreads standard to verify that the benchmarks themselves operate correctly. [POSIX (Portable Operating System Interface for UNIX) is a set of standards that define how programs and operating systems interact with each other. pthreads is a library of pre-emptive thread routines that comply with one of the POSIX standards.

  20. Parallel prefrontal pathways reach distinct excitatory and inhibitory systems in memory-related rhinal cortices.

    Science.gov (United States)

    Bunce, Jamie G; Zikopoulos, Basilis; Feinberg, Marcia; Barbas, Helen

    2013-12-15

    To investigate how prefrontal cortices impinge on medial temporal cortices we labeled pathways from the anterior cingulate cortex (ACC) and posterior orbitofrontal cortex (pOFC) in rhesus monkeys to compare their relationship with excitatory and inhibitory systems in rhinal cortices. The ACC pathway terminated mostly in areas 28 and 35 with a high proportion of large terminals, whereas the pOFC pathway terminated mostly through small terminals in area 36 and sparsely in areas 28 and 35. Both pathways terminated in all layers. Simultaneous labeling of pathways and distinct neurochemical classes of inhibitory neurons, followed by analyses of appositions of presynaptic and postsynaptic fluorescent signal, or synapses, showed overall predominant association with spines of putative excitatory neurons, but also significant interactions with presumed inhibitory neurons labeled for calretinin, calbindin, or parvalbumin. In the upper layers of areas 28 and 35 the ACC pathway was associated with dendrites of neurons labeled with calretinin, which are thought to disinhibit neighboring excitatory neurons, suggesting facilitated hippocampal access. In contrast, in area 36 pOFC axons were associated with dendrites of calbindin neurons, which are poised to reduce noise and enhance signal. In the deep layers, both pathways innervated mostly dendrites of parvalbumin neurons, which strongly inhibit neighboring excitatory neurons, suggesting gating of hippocampal output to other cortices. These findings suggest that the ACC, associated with attention and context, and the pOFC, associated with emotional valuation, have distinct contributions to memory in rhinal cortices, in processes that are disrupted in psychiatric diseases. Copyright © 2013 Wiley Periodicals, Inc.

  1. Speeding Up the String Comparison of the IDS Snort using Parallel Programming: A Systematic Literature Review on the Parallelized Aho-Corasick Algorithm

    Directory of Open Access Journals (Sweden)

    SILVA JUNIOR,J. B.

    2016-12-01

    Full Text Available The Intrusion Detection System (IDS needs to compare the contents of all packets arriving at the network interface with a set of signatures for indicating possible attacks, a task that consumes much CPU processing time. In order to alleviate this problem, some researchers have tried to parallelize the IDS's comparison engine, transferring execution from the CPU to GPU. This paper identifies and maps the parallelization features of the Aho-Corasick algorithm, which is used in Snort to compare patterns, in order to show this algorithm's implementation and execution issues, as well as optimization techniques for the Aho-Corasick machine. We have found 147 papers from important computer science publications databases, and have mapped them. We selected 22 and analyzed them in order to find our results. Our analysis of the papers showed, among other results, that parallelization of the AC algorithm is a new task and the authors have focused on the State Transition Table as the most common way to implement the algorithm on the GPU. Furthermore, we found that some techniques speed up the algorithm and reduce the required machine storage space are highly used, such as the algorithm running on the fastest memories and mechanisms for reducing the number of nodes and bit maping.

  2. Fast implementations of 3D PET reconstruction using vector and parallel programming techniques

    International Nuclear Information System (INIS)

    Guerrero, T.M.; Cherry, S.R.; Dahlbom, M.; Ricci, A.R.; Hoffman, E.J.

    1993-01-01

    Computationally intensive techniques that offer potential clinical use have arisen in nuclear medicine. Examples include iterative reconstruction, 3D PET data acquisition and reconstruction, and 3D image volume manipulation including image registration. One obstacle in achieving clinical acceptance of these techniques is the computational time required. This study focuses on methods to reduce the computation time for 3D PET reconstruction through the use of fast computer hardware, vector and parallel programming techniques, and algorithm optimization. The strengths and weaknesses of i860 microprocessor based workstation accelerator boards are investigated in implementations of 3D PET reconstruction

  3. PKA increases in the olfactory bulb act as unconditioned stimuli and provide evidence for parallel memory systems: pairing odor with increased PKA creates intermediate- and long-term, but not short-term, memories.

    Science.gov (United States)

    Grimes, Matthew T; Harley, Carolyn W; Darby-King, Andrea; McLean, John H

    2012-02-21

    Neonatal odor-preference memory in rat pups is a well-defined associative mammalian memory model dependent on cAMP. Previous work from this laboratory demonstrates three phases of neonatal odor-preference memory: short-term (translation-independent), intermediate-term (translation-dependent), and long-term (transcription- and translation-dependent). Here, we use neonatal odor-preference learning to explore the role of olfactory bulb PKA in these three phases of mammalian memory. PKA activity increased normally in learning animals 10 min after a single training trial. Inhibition of PKA by Rp-cAMPs blocked intermediate-term and long-term memory, with no effect on short-term memory. PKA inhibition also prevented learning-associated CREB phosphorylation, a transcription factor implicated in long-term memory. When long-term memory was rescued through increased β-adrenoceptor activation, CREB phosphorylation was restored. Intermediate-term and long-term, but not short-term odor-preference memories were generated by pairing odor with direct PKA activation using intrabulbar Sp-cAMPs, which bypasses β-adrenoceptor activation. Higher levels of Sp-cAMPs enhanced memory by extending normal 24-h retention to 48-72 h. These results suggest that increased bulbar PKA is necessary and sufficient for the induction of intermediate-term and long-term odor-preference memory, and suggest that PKA activation levels also modulate memory duration. However, short-term memory appears to use molecular mechanisms other than the PKA/CREB pathway. These mechanisms, which are also recruited by β-adrenoceptor activation, must operate in parallel with PKA activation.

  4. Flash Memory Reliability: Read, Program, and Erase Latency Versus Endurance Cycling

    Science.gov (United States)

    Heidecker, Jason

    2010-01-01

    This report documents the efforts and results of the fiscal year (FY) 2010 NASA Electronic Parts and Packaging Program (NEPP) task for nonvolatile memory (NVM) reliability. This year's focus was to measure latency (read, program, and erase) of NAND Flash memories and determine how these parameters drift with erase/program/read endurance cycling.

  5. A Parallel Distributed-Memory Particle Method Enables Acquisition-Rate Segmentation of Large Fluorescence Microscopy Images.

    Directory of Open Access Journals (Sweden)

    Yaser Afshar

    Full Text Available Modern fluorescence microscopy modalities, such as light-sheet microscopy, are capable of acquiring large three-dimensional images at high data rate. This creates a bottleneck in computational processing and analysis of the acquired images, as the rate of acquisition outpaces the speed of processing. Moreover, images can be so large that they do not fit the main memory of a single computer. We address both issues by developing a distributed parallel algorithm for segmentation of large fluorescence microscopy images. The method is based on the versatile Discrete Region Competition algorithm, which has previously proven useful in microscopy image segmentation. The present distributed implementation decomposes the input image into smaller sub-images that are distributed across multiple computers. Using network communication, the computers orchestrate the collectively solving of the global segmentation problem. This not only enables segmentation of large images (we test images of up to 10(10 pixels, but also accelerates segmentation to match the time scale of image acquisition. Such acquisition-rate image segmentation is a prerequisite for the smart microscopes of the future and enables online data compression and interactive experiments.

  6. A Parallel Distributed-Memory Particle Method Enables Acquisition-Rate Segmentation of Large Fluorescence Microscopy Images.

    Science.gov (United States)

    Afshar, Yaser; Sbalzarini, Ivo F

    2016-01-01

    Modern fluorescence microscopy modalities, such as light-sheet microscopy, are capable of acquiring large three-dimensional images at high data rate. This creates a bottleneck in computational processing and analysis of the acquired images, as the rate of acquisition outpaces the speed of processing. Moreover, images can be so large that they do not fit the main memory of a single computer. We address both issues by developing a distributed parallel algorithm for segmentation of large fluorescence microscopy images. The method is based on the versatile Discrete Region Competition algorithm, which has previously proven useful in microscopy image segmentation. The present distributed implementation decomposes the input image into smaller sub-images that are distributed across multiple computers. Using network communication, the computers orchestrate the collectively solving of the global segmentation problem. This not only enables segmentation of large images (we test images of up to 10(10) pixels), but also accelerates segmentation to match the time scale of image acquisition. Such acquisition-rate image segmentation is a prerequisite for the smart microscopes of the future and enables online data compression and interactive experiments.

  7. Experimental investigations of the large deflection capabilities of a compliant parallel mechanism actuated by shape memory alloy wires

    International Nuclear Information System (INIS)

    Sreekumar, M; Nagarajan, T; Singaperumal, M

    2008-01-01

    This experimental study investigates the coupled effect of the force developed by the shape memory alloy (SMA) actuators and the force required for the large deflection of an elastica member in a compliant parallel mechanism. The compliant mechanism developed in house consists of a moving platform mounted on a superelastic pillar and three SMA wire actuators to manipulate the platform. A three-axis MEMS accelerometer has been mounted on the moving platform to measure its tilt angle. Three miniature force sensors have been designed and fabricated out of cantilever beams, each mounted with a pair of strain gauges, to measure the force developed by the respective actuators. The force sensors are highly sensitive and cost effective compared to commercially available miniature force sensors. Calibration of the force sensors has been accomplished with known weights, and for the three-axis MEMS accelerometer a rotary base has been considered which is usually used in optical applications. The calibration curves obtained, with R-squared values between 0.9997 and 1.0, show that both the tilt and force sensors considered are most appropriate for the respective applications. The mechanism fixed with the sensors and the drivers for the SMA actuators is integrated with a National Instrument's data acquisition system. The experimental results have been compared with the analytical results and it was found that the relative error is less than 2%. This is a preliminary study in the development of a mechanism for eye prosthesis and similar applications

  8. A Parallel Distributed-Memory Particle Method Enables Acquisition-Rate Segmentation of Large Fluorescence Microscopy Images

    Science.gov (United States)

    Afshar, Yaser; Sbalzarini, Ivo F.

    2016-01-01

    Modern fluorescence microscopy modalities, such as light-sheet microscopy, are capable of acquiring large three-dimensional images at high data rate. This creates a bottleneck in computational processing and analysis of the acquired images, as the rate of acquisition outpaces the speed of processing. Moreover, images can be so large that they do not fit the main memory of a single computer. We address both issues by developing a distributed parallel algorithm for segmentation of large fluorescence microscopy images. The method is based on the versatile Discrete Region Competition algorithm, which has previously proven useful in microscopy image segmentation. The present distributed implementation decomposes the input image into smaller sub-images that are distributed across multiple computers. Using network communication, the computers orchestrate the collectively solving of the global segmentation problem. This not only enables segmentation of large images (we test images of up to 1010 pixels), but also accelerates segmentation to match the time scale of image acquisition. Such acquisition-rate image segmentation is a prerequisite for the smart microscopes of the future and enables online data compression and interactive experiments. PMID:27046144

  9. Self-stabilized discharge filament in plane-parallel barrier discharge configuration: formation, breakdown mechanism, and memory effects

    Science.gov (United States)

    Tschiersch, R.; Nemschokmichal, S.; Bogaczyk, M.; Meichsner, J.

    2017-10-01

    Single self-stabilized discharge filaments were investigated in the plane-parallel electrode configuration. The barrier discharge was operated inside a gap of 3 mm shielded by glass plates to both electrodes, using helium-nitrogen mixtures and a square-wave feeding voltage at a frequency of 2 kHz. The combined application of electrical measurements, ICCD camera imaging, optical emission spectroscopy and surface charge diagnostics via the electro-optic Pockels effect allowed the correlation of the discharge development in the volume and on the dielectric surfaces. The formation criteria and existence regimes were found by systematic variation of the nitrogen admixture to helium, the total pressure and the feeding voltage amplitude. Single self-stabilized discharge filaments can be operated over a wide parameter range, foremost, by significant reduction of the voltage amplitude after the operation in the microdischarge regime. Here, the outstanding importance of the surface charge memory effect on the long-term stability was pointed out by the recalculated spatio-temporally resolved gap voltage. The optical emission revealed discharge characteristics that are partially reminiscent of both the glow-like barrier discharge and the microdischarge regime, such as a Townsend pre-phase, a fast cathode-directed ionization front during the breakdown and radially propagating surface discharges during the afterglow.

  10. Searching for globally optimal functional forms for interatomic potentials using genetic programming with parallel tempering.

    Science.gov (United States)

    Slepoy, A; Peters, M D; Thompson, A P

    2007-11-30

    Molecular dynamics and other molecular simulation methods rely on a potential energy function, based only on the relative coordinates of the atomic nuclei. Such a function, called a force field, approximately represents the electronic structure interactions of a condensed matter system. Developing such approximate functions and fitting their parameters remains an arduous, time-consuming process, relying on expert physical intuition. To address this problem, a functional programming methodology was developed that may enable automated discovery of entirely new force-field functional forms, while simultaneously fitting parameter values. The method uses a combination of genetic programming, Metropolis Monte Carlo importance sampling and parallel tempering, to efficiently search a large space of candidate functional forms and parameters. The methodology was tested using a nontrivial problem with a well-defined globally optimal solution: a small set of atomic configurations was generated and the energy of each configuration was calculated using the Lennard-Jones pair potential. Starting with a population of random functions, our fully automated, massively parallel implementation of the method reproducibly discovered the original Lennard-Jones pair potential by searching for several hours on 100 processors, sampling only a minuscule portion of the total search space. This result indicates that, with further improvement, the method may be suitable for unsupervised development of more accurate force fields with completely new functional forms. Copyright (c) 2007 Wiley Periodicals, Inc.

  11. Expressing Parallelism with ROOT

    Energy Technology Data Exchange (ETDEWEB)

    Piparo, D. [CERN; Tejedor, E. [CERN; Guiraud, E. [CERN; Ganis, G. [CERN; Mato, P. [CERN; Moneta, L. [CERN; Valls Pla, X. [CERN; Canal, P. [Fermilab

    2017-11-22

    The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.

  12. Expressing Parallelism with ROOT

    Science.gov (United States)

    Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.

    2017-10-01

    The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.

  13. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets

    Science.gov (United States)

    Shrimankar, D. D.; Sathe, S. R.

    2016-01-01

    Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today’s supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures. PMID:27932868

  14. Analysis of Parallel Algorithms on SMP Node and Cluster of Workstations Using Parallel Programming Models with New Tile-based Method for Large Biological Datasets.

    Science.gov (United States)

    Shrimankar, D D; Sathe, S R

    2016-01-01

    Sequence alignment is an important tool for describing the relationships between DNA sequences. Many sequence alignment algorithms exist, differing in efficiency, in their models of the sequences, and in the relationship between sequences. The focus of this study is to obtain an optimal alignment between two sequences of biological data, particularly DNA sequences. The algorithm is discussed with particular emphasis on time, speedup, and efficiency optimizations. Parallel programming presents a number of critical challenges to application developers. Today's supercomputer often consists of clusters of SMP nodes. Programming paradigms such as OpenMP and MPI are used to write parallel codes for such architectures. However, the OpenMP programs cannot be scaled for more than a single SMP node. However, programs written in MPI can have more than single SMP nodes. But such a programming paradigm has an overhead of internode communication. In this work, we explore the tradeoffs between using OpenMP and MPI. We demonstrate that the communication overhead incurs significantly even in OpenMP loop execution and increases with the number of cores participating. We also demonstrate a communication model to approximate the overhead from communication in OpenMP loops. Our results are astonishing and interesting to a large variety of input data files. We have developed our own load balancing and cache optimization technique for message passing model. Our experimental results show that our own developed techniques give optimum performance of our parallel algorithm for various sizes of input parameter, such as sequence size and tile size, on a wide variety of multicore architectures.

  15. Study on MPI/OpenMP hybrid parallelism for Monte Carlo neutron transport code

    International Nuclear Information System (INIS)

    Liang Jingang; Xu Qi; Wang Kan; Liu Shiwen

    2013-01-01

    Parallel programming with mixed mode of messages-passing and shared-memory has several advantages when used in Monte Carlo neutron transport code, such as fitting hardware of distributed-shared clusters, economizing memory demand of Monte Carlo transport, improving parallel performance, and so on. MPI/OpenMP hybrid parallelism was implemented based on a one dimension Monte Carlo neutron transport code. Some critical factors affecting the parallel performance were analyzed and solutions were proposed for several problems such as contention access, lock contention and false sharing. After optimization the code was tested finally. It is shown that the hybrid parallel code can reach good performance just as pure MPI parallel program, while it saves a lot of memory usage at the same time. Therefore hybrid parallel is efficient for achieving large-scale parallel of Monte Carlo neutron transport. (authors)

  16. What is "the patient perspective" in patient engagement programs? Implicit logics and parallels to feminist theories.

    Science.gov (United States)

    Rowland, Paula; McMillan, Sarah; McGillicuddy, Patti; Richards, Joy

    2017-01-01

    Public and patient involvement (PPI) in health care may refer to many different processes, ranging from participating in decision-making about one's own care to participating in health services research, health policy development, or organizational reforms. Across these many forms of public and patient involvement, the conceptual and theoretical underpinnings remain poorly articulated. Instead, most public and patient involvement programs rely on policy initiatives as their conceptual frameworks. This lack of conceptual clarity participates in dilemmas of program design, implementation, and evaluation. This study contributes to the development of theoretical understandings of public and patient involvement. In particular, we focus on the deployment of patient engagement programs within health service organizations. To develop a deeper understanding of the conceptual underpinnings of these programs, we examined the concept of "the patient perspective" as used by patient engagement practitioners and participants. Specifically, we focused on the way this phrase was used in the singular: "the" patient perspective or "the" patient voice. From qualitative analysis of interviews with 20 patient advisers and 6 staff members within a large urban health network in Canada, we argue that "the patient perspective" is referred to as a particular kind of situated knowledge, specifically an embodied knowledge of vulnerability. We draw parallels between this logic of patient perspective and the logic of early feminist theory, including the concepts of standpoint theory and strong objectivity. We suggest that champions of patient engagement may learn much from the way feminist theorists have constructed their arguments and addressed critique.

  17. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    Directory of Open Access Journals (Sweden)

    Matthew O'keefe

    1995-01-01

    Full Text Available Massively parallel processors (MPPs hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. We have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.

  18. Drainage network extraction from a high-resolution DEM using parallel programming in the .NET Framework

    Science.gov (United States)

    Du, Chao; Ye, Aizhong; Gan, Yanjun; You, Jinjun; Duan, Qinyun; Ma, Feng; Hou, Jingwen

    2017-12-01

    High-resolution Digital Elevation Models (DEMs) can be used to extract high-accuracy prerequisite drainage networks. A higher resolution represents a larger number of grids. With an increase in the number of grids, the flow direction determination will require substantial computer resources and computing time. Parallel computing is a feasible method with which to resolve this problem. In this paper, we proposed a parallel programming method within the .NET Framework with a C# Compiler in a Windows environment. The basin is divided into sub-basins, and subsequently the different sub-basins operate on multiple threads concurrently to calculate flow directions. The method was applied to calculate the flow direction of the Yellow River basin from 3 arc-second resolution SRTM DEM. Drainage networks were extracted and compared with HydroSHEDS river network to assess their accuracy. The results demonstrate that this method can calculate the flow direction from high-resolution DEMs efficiently and extract high-precision continuous drainage networks.

  19. The language parallel Pascal and other aspects of the massively parallel processor

    Science.gov (United States)

    Reeves, A. P.; Bruner, J. D.

    1982-01-01

    A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

  20. Telephone word-list recall tested in the rural aging and memory study: two parallel versions for the TICS-M.

    Science.gov (United States)

    Hogervorst, Eva; Bandelow, Stephan; Hart, John; Henderson, Victor W

    2004-09-01

    Parallel versions of memory tasks are useful in clinical and research settings to reduce practice effects engendered by multiple administrations. We aimed to investigate the usefulness of three parallel versions of ten-item word list recall tasks administered by telephone. A population based telephone survey of middle-aged and elderly residents of Bradley County, Arkansas was carried out as part of the Rural Aging and Memory Study (RAMS). Participants in the study were 1845 persons aged 40 to 95 years. Word lists included that used in the telephone interview of cognitive status (TICS) as a criterion standard and two newly developed lists. The mean age of participants was 61.05 (SD 12.44) years; 39.5% were over age 65. 78% of the participants had completed high school, 66% were women and 21% were African-American. There was no difference in demographic characteristics between groups receiving different word list versions, and performances on the three versions were equivalent for both immediate (mean 4.22, SD 1.53) and delayed (mean 2.35 SD 1.75) recall trials. The total memory score (immediate+delayed recall) was negatively associated with older age (beta = -0.41, 95%CI=-0.11 to -0.04), lower education (beta = 0.24, 95%CI = 0.36 to 0.51), male gender (beta = -0.18, 95%CI = -1.39 to -0.90) and African-American race (beta = -0.15, 95%CI = -1.41 to -0.82). The two RAMS word recall lists and the TICS word recall list can be used interchangeably in telephone assessment of memory of middle-aged and elderly persons. This finding is important for future studies where parallel versions of a word-list memory task are needed. (250 words).

  1. Parallelizing Gene Expression Programming Algorithm in Enabling Large-Scale Classification

    Directory of Open Access Journals (Sweden)

    Lixiong Xu

    2017-01-01

    Full Text Available As one of the most effective function mining algorithms, Gene Expression Programming (GEP algorithm has been widely used in classification, pattern recognition, prediction, and other research fields. Based on the self-evolution, GEP is able to mine an optimal function for dealing with further complicated tasks. However, in big data researches, GEP encounters low efficiency issue due to its long time mining processes. To improve the efficiency of GEP in big data researches especially for processing large-scale classification tasks, this paper presents a parallelized GEP algorithm using MapReduce computing model. The experimental results show that the presented algorithm is scalable and efficient for processing large-scale classification tasks.

  2. A short executive function training program improves preschoolers’ working memory

    Directory of Open Access Journals (Sweden)

    Emma eBlakey

    2015-11-01

    Full Text Available Cognitive training has been shown to improve executive functions in middle childhood and adulthood. However, fewer studies have targeted the preschool years – a time when executive functions undergo rapid development. The present study tested the effects of a short four session executive function training program in 54 four-year-olds. The training group significantly improved their working memory from pre-training relative to an active control group. Notably, this effect extended to a task sharing few surface features with the trained tasks, and continued to be apparent three months later. In addition, the benefits of training extended to a measure of mathematical reasoning three months later, indicating that training executive functions during the preschool years has the potential to convey benefits that are both long-lasting and wide-ranging.

  3. Chemically programmed ink-jet printed resistive WORM memory array and readout circuit

    International Nuclear Information System (INIS)

    Andersson, H; Manuilskiy, A; Sidén, J; Gao, J; Kunninmel, G V; Nilsson, H-E; Hummelgård, M

    2014-01-01

    In this paper an ink-jet printed write once read many (WORM) resistive memory fabricated on paper substrate is presented. The memory elements are programmed for different resistance states by printing triethylene glycol monoethyl ether on the substrate before the actual memory element is printed using silver nano particle ink. The resistance is thus able to be set to a broad range of values without changing the geometry of the elements. A memory card consisting of 16 elements is manufactured for which the elements are each programmed to one of four defined logic levels, providing a total of 4294 967 296 unique possible combinations. Using a readout circuit, originally developed for resistive sensors to avoid crosstalk between elements, a memory card reader is manufactured that is able to read the values of the memory card and transfer the data to a PC. Such printed memory cards can be used in various applications. (paper)

  4. Bayer image parallel decoding based on GPU

    Science.gov (United States)

    Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua

    2012-11-01

    In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.

  5. Translation techniques for distributed-shared memory programming models

    Energy Technology Data Exchange (ETDEWEB)

    Fuller, Douglas James [Iowa State Univ., Ames, IA (United States)

    2005-01-01

    The high performance computing community has experienced an explosive improvement in distributed-shared memory hardware. Driven by increasing real-world problem complexity, this explosion has ushered in vast numbers of new systems. Each new system presents new challenges to programmers and application developers. Part of the challenge is adapting to new architectures with new performance characteristics. Different vendors release systems with widely varying architectures that perform differently in different situations. Furthermore, since vendors need only provide a single performance number (total MFLOPS, typically for a single benchmark), they only have strong incentive initially to optimize the API of their choice. Consequently, only a fraction of the available APIs are well optimized on most systems. This causes issues porting and writing maintainable software, let alone issues for programmers burdened with mastering each new API as it is released. Also, programmers wishing to use a certain machine must choose their API based on the underlying hardware instead of the application. This thesis argues that a flexible, extensible translator for distributed-shared memory APIs can help address some of these issues. For example, a translator might take as input code in one API and output an equivalent program in another. Such a translator could provide instant porting for applications to new systems that do not support the application's library or language natively. While open-source APIs are abundant, they do not perform optimally everywhere. A translator would also allow performance testing using a single base code translated to a number of different APIs. Most significantly, this type of translator frees programmers to select the most appropriate API for a given application based on the application (and developer) itself instead of the underlying hardware.

  6. Lateralized odor preference training in rat pups reveals an enhanced network response in anterior piriform cortex to olfactory input that parallels extended memory.

    Science.gov (United States)

    Fontaine, Christine J; Harley, Carolyn W; Yuan, Qi

    2013-09-18

    The present study examines synaptic plasticity in the anterior piriform cortex (aPC) using ex vivo slices from rat pups given lateralized odor preference training. In the early odor preference learning model, a brief 10 min training session yields 24 h memory, while four daily sessions yield 48 h memory. Odor preference memory can be lateralized through naris occlusion as the anterior commissure is not yet functional. AMPA receptor-mediated postsynaptic responses in the aPC to lateral olfactory tract input, shown to be enhanced at 24 h, are no longer enhanced 48 h after a single training session. Following four spaced lateralized trials, the AMPA receptor-mediated fEPSP is enhanced in the trained aPC at 48 h. Calcium imaging of aPC pyramidal cells within 48 h revealed decreased firing thresholds in the pyramidal cell network. Thus multiday odor preference training induced increased odor input responsiveness in previously weakly activated aPC cells. These results support the hypothesis that increased synaptic strength in olfactory input networks mediates odor preference memory. The increase in aPC network activation parallels behavioral memory.

  7. Memory

    Science.gov (United States)

    ... it has to decide what is worth remembering. Memory is the process of storing and then remembering this information. There are different types of memory. Short-term memory stores information for a few ...

  8. Memory-Optimized Software Synthesis from Dataflow Program Graphs with Large Size Data Samples

    Directory of Open Access Journals (Sweden)

    Hyunok Oh

    2003-05-01

    Full Text Available In multimedia and graphics applications, data samples of nonprimitive type require significant amount of buffer memory. This paper addresses the problem of minimizing the buffer memory requirement for such applications in embedded software synthesis from graphical dataflow programs based on the synchronous dataflow (SDF model with the given execution order of nodes. We propose a memory minimization technique that separates global memory buffers from local pointer buffers: the global buffers store live data samples and the local buffers store the pointers to the global buffer entries. The proposed algorithm reduces 67% memory for a JPEG encoder, 40% for an H.263 encoder compared with unshared versions, and 22% compared with the previous sharing algorithm for the H.263 encoder. Through extensive buffer sharing optimization, we believe that automatic software synthesis from dataflow program graphs achieves the comparable code quality with the manually optimized code in terms of memory requirement.

  9. Impaired hippocampal acetylcholine release parallels spatial memory deficits in Tg2576 mice subjected to basal forebrain cholinergic degeneration

    DEFF Research Database (Denmark)

    Laursen, Bettina; Mørk, Arne; Plath, Niels

    2013-01-01

    (BFCD) in 3 months old male Tg2576 mice to co-express cholinergic degeneration with Aβ overexpression as these characteristics constitutes key hallmarks of AD. At 9 months, SAP lesioned Tg2576 mice were cognitively impaired in two spatial paradigms addressing working memory and mid to long-term memory...

  10. VHDL-based programming environment for Floating-Gate analog memory cell

    Directory of Open Access Journals (Sweden)

    Carlos Alberto dos Reis Filho

    2005-02-01

    Full Text Available An implementation in CMOS technology of a Floating-Gate Analog Memory Cell and Programming Environment is presented. A digital closed-loop control compares a reference value set by user and the memory output and after cycling, the memory output is updated and the new value stored. The circuit can be used as analog trimming for VLSI applications where mechanical trimming associated with postprocessing chip is prohibitive due to high costs.

  11. Pattern-Driven Automatic Parallelization

    Directory of Open Access Journals (Sweden)

    Christoph W. Kessler

    1996-01-01

    Full Text Available This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.

  12. Modularity, Working Memory, and Second Language Acquisition: A Research Program

    Science.gov (United States)

    Truscott, John

    2017-01-01

    Considerable reason exists to view the mind, and language within it, as modular, and this view has an important place in research and theory in second language acquisition (SLA) and beyond. But it has had very little impact on the study of working memory and its role in SLA. This article considers the need for modular study of working memory,…

  13. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

    KAUST Repository

    Haidar, Azzam; Ltaief, Hatem; Dongarra, Jack

    2011-01-01

    of aggregating fine-grained and memory-aware computational tasks during both stages, while sustaining the application's overall high performance. A dynamic runtime environment system then schedules the different tasks in an out-of-order fashion. The performance

  14. SPSS and SAS programs for determining the number of components using parallel analysis and velicer's MAP test.

    Science.gov (United States)

    O'Connor, B P

    2000-08-01

    Popular statistical software packages do not have the proper procedures for determining the number of components in factor and principal components analyses. Parallel analysis and Velicer's minimum average partial (MAP) test are validated procedures, recommended widely by statisticians. However, many researchers continue to use alternative, simpler, but flawed procedures, such as the eigenvalues-greater-than-one rule. Use of the proper procedures might be increased if these procedures could be conducted within familiar software environments. This paper describes brief and efficient programs for using SPSS and SAS to conduct parallel analyses and the MAP test.

  15. Is Monte Carlo embarrassingly parallel?

    Energy Technology Data Exchange (ETDEWEB)

    Hoogenboom, J. E. [Delft Univ. of Technology, Mekelweg 15, 2629 JB Delft (Netherlands); Delft Nuclear Consultancy, IJsselzoom 2, 2902 LB Capelle aan den IJssel (Netherlands)

    2012-07-01

    Monte Carlo is often stated as being embarrassingly parallel. However, running a Monte Carlo calculation, especially a reactor criticality calculation, in parallel using tens of processors shows a serious limitation in speedup and the execution time may even increase beyond a certain number of processors. In this paper the main causes of the loss of efficiency when using many processors are analyzed using a simple Monte Carlo program for criticality. The basic mechanism for parallel execution is MPI. One of the bottlenecks turn out to be the rendez-vous points in the parallel calculation used for synchronization and exchange of data between processors. This happens at least at the end of each cycle for fission source generation in order to collect the full fission source distribution for the next cycle and to estimate the effective multiplication factor, which is not only part of the requested results, but also input to the next cycle for population control. Basic improvements to overcome this limitation are suggested and tested. Also other time losses in the parallel calculation are identified. Moreover, the threading mechanism, which allows the parallel execution of tasks based on shared memory using OpenMP, is analyzed in detail. Recommendations are given to get the maximum efficiency out of a parallel Monte Carlo calculation. (authors)

  16. Is Monte Carlo embarrassingly parallel?

    International Nuclear Information System (INIS)

    Hoogenboom, J. E.

    2012-01-01

    Monte Carlo is often stated as being embarrassingly parallel. However, running a Monte Carlo calculation, especially a reactor criticality calculation, in parallel using tens of processors shows a serious limitation in speedup and the execution time may even increase beyond a certain number of processors. In this paper the main causes of the loss of efficiency when using many processors are analyzed using a simple Monte Carlo program for criticality. The basic mechanism for parallel execution is MPI. One of the bottlenecks turn out to be the rendez-vous points in the parallel calculation used for synchronization and exchange of data between processors. This happens at least at the end of each cycle for fission source generation in order to collect the full fission source distribution for the next cycle and to estimate the effective multiplication factor, which is not only part of the requested results, but also input to the next cycle for population control. Basic improvements to overcome this limitation are suggested and tested. Also other time losses in the parallel calculation are identified. Moreover, the threading mechanism, which allows the parallel execution of tasks based on shared memory using OpenMP, is analyzed in detail. Recommendations are given to get the maximum efficiency out of a parallel Monte Carlo calculation. (authors)

  17. Parallel sorting algorithms

    CERN Document Server

    Akl, Selim G

    1985-01-01

    Parallel Sorting Algorithms explains how to use parallel algorithms to sort a sequence of items on a variety of parallel computers. The book reviews the sorting problem, the parallel models of computation, parallel algorithms, and the lower bounds on the parallel sorting problems. The text also presents twenty different algorithms, such as linear arrays, mesh-connected computers, cube-connected computers. Another example where algorithm can be applied is on the shared-memory SIMD (single instruction stream multiple data stream) computers in which the whole sequence to be sorted can fit in the

  18. From functional programming to multicore parallelism: A case study based on Presburger Arithmetic

    DEFF Research Database (Denmark)

    Dung, Phan Anh; Hansen, Michael Reichhardt

    2011-01-01

    , we are interested in using PA in connection with the Duration Calculus Model Checker (DCMC) [5]. There are effective decision procedures for PA including Cooper’s algorithm and the Omega Test; however, their complexity is extremely high with doubly exponential lower bound and triply exponential upper...... bound [7]. We investigate these decision procedures in the context of multicore parallelism with the hope of exploiting multicore powers. Unfortunately, we are not aware of any prior parallelism research related to decision procedures for PA. The closest work is the preliminary results on parallelism...

  19. Transactional Memory

    CERN Document Server

    Harris, Tim; Rajwar, Ravi

    2010-01-01

    The advent of multicore processors has renewed interest in the idea of incorporating transactions into the programming model used to write parallel programs.This approach, known as transactional memory, offers an alternative, and hopefully better, way to coordinate concurrent threads. The ACI(atomicity, consistency, isolation) properties of transactions provide a foundation to ensure that concurrent reads and writes of shared data do not produce inconsistent or incorrect results. At a higher level, a computation wrapped in a transaction executes atomically - either it completes successfullyand

  20. Development of whole core thermal-hydraulic analysis program ACT. 4. Simplified fuel assembly model and parallelization by MPI

    International Nuclear Information System (INIS)

    Ohshima, Hiroyuki

    2001-10-01

    A whole core thermal-hydraulic analysis program ACT is being developed for the purpose of evaluating detailed in-core thermal hydraulic phenomena of fast reactors including the effect of the flow between wrapper-tube walls (inter-wrapper flow) under various reactor operation conditions. As appropriate boundary conditions in addition to a detailed modeling of the core are essential for accurate simulations of in-core thermal hydraulics, ACT consists of not only fuel assembly and inter-wrapper flow analysis modules but also a heat transport system analysis module that gives response of the plant dynamics to the core model. This report describes incorporation of a simplified model to the fuel assembly analysis module and program parallelization by a message passing method toward large-scale simulations. ACT has a fuel assembly analysis module which can simulate a whole fuel pin bundle in each fuel assembly of the core and, however, it may take much CPU time for a large-scale core simulation. Therefore, a simplified fuel assembly model that is thermal-hydraulically equivalent to the detailed one has been incorporated in order to save the simulation time and resources. This simplified model is applied to several parts of fuel assemblies in a core where the detailed simulation results are not required. With regard to the program parallelization, the calculation load and the data flow of ACT were analyzed and the optimum parallelization has been done including the improvement of the numerical simulation algorithm of ACT. Message Passing Interface (MPI) is applied to data communication between processes and synchronization in parallel calculations. Parallelized ACT was verified through a comparison simulation with the original one. In addition to the above works, input manuals of the core analysis module and the heat transport system analysis module have been prepared. (author)

  1. Parallel R

    CERN Document Server

    McCallum, Ethan

    2011-01-01

    It's tough to argue with R as a high-quality, cross-platform, open source statistical software product-unless you're in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets. You'll learn the basics of Snow, Multicore, Parallel, and some Hadoop-related tools, including how to find them, how to use them, when they work well, and when they don't. With these packages, you can overcome R's single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R's memory barrier.

  2. A program of positive intervention in the elderly: memories, gratitude and forgiveness.

    Science.gov (United States)

    Ramírez, Encarnación; Ortega, Ana Raquel; Chamorro, Alberto; Colmenero, José María

    2014-05-01

    The main goal of this study has been to increase the quality of life in people of over 60 years through a positive psychology intervention. We employed a program which consists of training based on autobiographical memory, forgiveness and gratitude. The sample consisted of 46 participants aged 60-93 years. State and trait anxiety, depression, general memory, specific memories, life satisfaction and subjective happiness were measured. The results revealed that participants who followed the program (experimental group) showed a significant decrease in state anxiety and depression as well as an increase in specific memories, life satisfaction and subjective happiness, compared with the placebo group. Our program offers promising results and provides new evidence for the effectiveness of positive interventions in the field of psychogerontology, helping increase subjective well-being and quality of life in older adults by focusing interventions on the enhancement of personal and social resources for being happy.

  3. Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

    Science.gov (United States)

    Harper, Richard

    1989-01-01

    In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.

  4. Parallel processing of Monte Carlo code MCNP for particle transport problem

    Energy Technology Data Exchange (ETDEWEB)

    Higuchi, Kenji; Kawasaki, Takuji

    1996-06-01

    It is possible to vectorize or parallelize Monte Carlo codes (MC code) for photon and neutron transport problem, making use of independency of the calculation for each particle. Applicability of existing MC code to parallel processing is mentioned. As for parallel computer, we have used both vector-parallel processor and scalar-parallel processor in performance evaluation. We have made (i) vector-parallel processing of MCNP code on Monte Carlo machine Monte-4 with four vector processors, (ii) parallel processing on Paragon XP/S with 256 processors. In this report we describe the methodology and results for parallel processing on two types of parallel or distributed memory computers. In addition, we mention the evaluation of parallel programming environments for parallel computers used in the present work as a part of the work developing STA (Seamless Thinking Aid) Basic Software. (author)

  5. Parallel Object Oriented MD Simulation Program for Long Time Simulations of Metallic Glasses and Undercooled Liquids

    Science.gov (United States)

    Böddeker, B.; Teichler, H.

    The MD simulation program TABB is motivated by the need of long time simulations for the investigation of slow processes near the glass transition of glass forming alloys. TABB is written in C++ with a high degree of flexibility: TABB allows the use of any short ranged pair potentials or EAM potentials, by generating and using a spline representation of all functions and their derivatives. TABB supports several numerical integration algorithms like the Runge-Kotta or the modified Gear-predictor-corrector algorithm of order five. The boundary conditions can be chosen to resemble the geometry of bulk materials or films. The simulation box length or the pressure can be fixed for each dimension separately. TABB may be used in isokinetic, isoenergeric or canonic (with random forces) mode. TABB contains a simple instruction interpreter to easily control the parameters and options during the simulation. The same source code can be compiled either for workstations or for parallel computers. The main optimization goal of TABB is to allow long time simulations of medium or small sized systems. To make this possible, much attention is spent on the optimized communication between the nodes. TABB uses a domain decomposition procedure. To use many nodes with a small system, the domain size has to be small compared to the range of particle interactions. In the limit of many nodes for only few atoms, the bottle neck of communication is the latency time. TABB minimizes the number of pairs of domains containing atoms that interact between these domains. This procedure minimizes the need of communication calls between pairs of nodes. TABB decides automatically, to how many, and to which directions the decomposition shall be applied. E.g., in the case of one dimensional domain decomposition, the simulation box is only split into "slabs" along a selected direction. The three dimensional domain decomposition is best with respect to the number of interacting domains only for simulations

  6. Computer-Aided Parallelizer and Optimizer

    Science.gov (United States)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  7. What we remember affects how we see: spatial working memory steers saccade programming.

    Science.gov (United States)

    Wong, Jason H; Peterson, Matthew S

    2013-02-01

    Relationships between visual attention, saccade programming, and visual working memory have been hypothesized for over a decade. Awh, Jonides, and Reuter-Lorenz (Journal of Experimental Psychology: Human Perception and Performance 24(3):780-90, 1998) and Awh et al. (Psychological Science 10(5):433-437, 1999) proposed that rehearsing a location in memory also leads to enhanced attentional processing at that location. In regard to eye movements, Belopolsky and Theeuwes (Attention, Perception & Psychophysics 71(3):620-631, 2009) found that holding a location in working memory affects saccade programming, albeit negatively. In three experiments, we attempted to replicate the findings of Belopolsky and Theeuwes (Attention, Perception & Psychophysics 71(3):620-631, 2009) and determine whether the spatial memory effect can occur in other saccade-cuing paradigms, including endogenous central arrow cues and exogenous irrelevant singletons. In the first experiment, our results were the opposite of those in Belopolsky and Theeuwes (Attention, Perception & Psychophysics 71(3):620-631, 2009), in that we found facilitation (shorter saccade latencies) instead of inhibition when the saccade target matched the region in spatial working memory. In Experiment 2, we sought to determine whether the spatial working memory effect would generalize to other endogenous cuing tasks, such as a central arrow that pointed to one of six possible peripheral locations. As in Experiment 1, we found that saccade programming was facilitated when the cued location coincided with the saccade target. In Experiment 3, we explored how spatial memory interacts with other types of cues, such as a peripheral color singleton target or irrelevant onset. In both cases, the eyes were more likely to go to either singleton when it coincided with the location held in spatial working memory. On the basis of these results, we conclude that spatial working memory and saccade programming are likely to share common

  8. Adapting algorithms to massively parallel hardware

    CERN Document Server

    Sioulas, Panagiotis

    2016-01-01

    In the recent years, the trend in computing has shifted from delivering processors with faster clock speeds to increasing the number of cores per processor. This marks a paradigm shift towards parallel programming in which applications are programmed to exploit the power provided by multi-cores. Usually there is gain in terms of the time-to-solution and the memory footprint. Specifically, this trend has sparked an interest towards massively parallel systems that can provide a large number of processors, and possibly computing nodes, as in the GPUs and MPPAs (Massively Parallel Processor Arrays). In this project, the focus was on two distinct computing problems: k-d tree searches and track seeding cellular automata. The goal was to adapt the algorithms to parallel systems and evaluate their performance in different cases.

  9. Honoring our donors: a survey of memorial ceremonies in United States anatomy programs.

    Science.gov (United States)

    Jones, Trahern W; Lachman, Nirusha; Pawlina, Wojciech

    2014-01-01

    Many anatomy programs that incorporate dissection of donated human bodies hold memorial ceremonies of gratitude towards body donors. The content of these ceremonies may include learners' reflections on mortality, respect, altruism, and personal growth told through various humanities modalities. The task of planning is usually student- and faculty-led with participation from other health care students. Objective information on current memorial ceremonies for body donors in anatomy programs in the United States appears to be lacking. The number of programs in the United States that currently plan these memorial ceremonies and information on trends in programs undertaking such ceremonies remain unknown. Gross anatomy program directors throughout the United States were contacted and asked to respond to a voluntary questionnaire on memorial ceremonies held at their institution. The results (response rate 68.2%) indicated that a majority of human anatomy programs (95.5%) hold memorial ceremonies. These ceremonies are, for the most part, student-driven and nondenominational or secular in nature. Participants heavily rely upon speech, music, poetry, and written essays, with a small inclusion of other humanities modalities, such as dance or visual art, to explore a variety of themes during these ceremonies. © 2013 American Association of Anatomists.

  10. Distributed-memory matrix computations

    DEFF Research Database (Denmark)

    Balle, Susanne Mølleskov

    1995-01-01

    The main goal of this project is to investigate, develop, and implement algorithms for numerical linear algebra on parallel computers in order to acquire expertise in methods for parallel computations. An important motivation for analyzaing and investigating the potential for parallelism in these......The main goal of this project is to investigate, develop, and implement algorithms for numerical linear algebra on parallel computers in order to acquire expertise in methods for parallel computations. An important motivation for analyzaing and investigating the potential for parallelism...... in these algorithms is that many scientific applications rely heavily on the performance of the involved dense linear algebra building blocks. Even though we consider the distributed-memory as well as the shared-memory programming paradigm, the major part of the thesis is dedicated to distributed-memory architectures....... We emphasize distributed-memory massively parallel computers - such as the Connection Machines model CM-200 and model CM-5/CM-5E - available to us at UNI-C and at Thinking Machines Corporation. The CM-200 was at the time this project started one of the few existing massively parallel computers...

  11. The Garrett Lee Smith Memorial Suicide Prevention Program

    Science.gov (United States)

    Goldston, David B.; Walrath, Christine M.; McKeon, Richard; Puddy, Richard W.; Lubell, Keri M.; Potter, Lloyd B.; Rodi, Michael S.

    2010-01-01

    In response to calls for greater efforts to reduce youth suicide, the Garrett Lee Smith (GLS) Memorial Act has provided funding for 68 state, territory, and tribal community grants, and 74 college campus grants for suicide prevention efforts. Suicide prevention activities supported by GLS grantees have included education, training programs…

  12. Automatically Proving Termination and Memory Safety for Programs with Pointer Arithmetic

    DEFF Research Database (Denmark)

    Ströder, Thomas; Giesl, Jürgen; Brockschmidt, Marc

    2017-01-01

    While automated verification of imperative programs has been studied intensively, proving termination of programs with explicit pointer arithmetic fully automatically was still an open problem. To close this gap, we introduce a novel abstract domain that can track allocated memory in detail. We use...

  13. Identifying Inter-task Communication in Shared Memory Programming Models

    DEFF Research Database (Denmark)

    Larsen, Per; Karlsson, Sven; Madsen, Jan

    2009-01-01

    as a set of runtime operations which are necessary to enforce declarations made by the programmer. The cost and scalability of the runtime operations are evaluated using micro-benchmarks and a benchmark from the NAS parallel benchmark suite. The measurements show that the overhead of the runtime operations...

  14. Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

    OpenAIRE

    Mathuriya, Amrita; Luo, Ye; Benali, Anouar; Shulenburger, Luke; Kim, Jeongnim

    2016-01-01

    B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present node-level optimizations of B-spline evaluations on multi/many-core shared memory processors. To increase SIMD efficiency and bandwidth utilization, we first apply data layout transfo...

  15. Cognitive Training Program to Improve Working Memory in Older Adults with MCI.

    Science.gov (United States)

    Hyer, Lee; Scott, Ciera; Atkinson, Mary Michael; Mullen, Christine M; Lee, Anna; Johnson, Aaron; Mckenzie, Laura C

    2016-01-01

    Deficits in working memory (WM) are associated with age-related decline. We report findings from a clinical trial that examined the effectiveness of Cogmed, a computerized program that trains WM. We compare this program to a Sham condition in older adults with Mild Cognitive Impairment (MCI). Older adults (N = 68) living in the community were assessed. Participants reported memory impairment and met criteria for MCI, either by poor delayed memory or poor performance in other cognitive areas. The Repeatable Battery for the Assessment of Neuropsychological Status (RBANS, Delayed Memory Index) and the Clinical Dementia Rating scale (CDR) were utilized. All presented with normal Mini Mental State Exams (MMSE) and activities of daily living (ADLs). Participants were randomized to Cogmed or a Sham computer program. Twenty-five sessions were completed over five to seven weeks. Pre, post, and follow-up measures included a battery of cognitive measures (three WM tests), a subjective memory scale, and a functional measure. Both intervention groups improved over time. Cogmed significantly outperformed Sham on Span Board and exceeded in subjective memory reports at follow-up as assessed by the Cognitive Failures Questionnaire (CFQ). The Cogmed group demonstrated better performance on the Functional Activities Questionnaire (FAQ), a measure of adjustment and far transfer, at follow-up. Both groups, especially Cogmed, enjoyed the intervention. Results suggest that WM was enhanced in both groups of older adults with MCI. Cogmed was better on one core WM measure and had higher ratings of satisfaction. The Sham condition declined on adjustment.

  16. CUDA/GPU Technology : Parallel Programming For High Performance Scientific Computing

    OpenAIRE

    YUHENDRA; KUZE, Hiroaki; JOSAPHAT, Tetuko Sri Sumantyo

    2009-01-01

    [ABSTRACT]Graphics processing units (GP Us) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. In the high performance computation capabilities, graphic processing units (GPU) lead to much more powerful performance than conventional CPUs by means of parallel processing. In 2007, the birth of Compute Unified Device Architecture (CUDA) and CUDA-enabled GPUs by NVIDIA Corporation brought a revolution in the general purpose GPU a...

  17. Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

    Science.gov (United States)

    Bellerby, Tim

    2015-04-01

    PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks number of processors

  18. Abstract Level Parallelization of Finite Difference Methods

    Directory of Open Access Journals (Sweden)

    Edwin Vollebregt

    1997-01-01

    Full Text Available A formalism is proposed for describing finite difference calculations in an abstract way. The formalism consists of index sets and stencils, for characterizing the structure of sets of data items and interactions between data items (“neighbouring relations”. The formalism provides a means for lifting programming to a more abstract level. This simplifies the tasks of performance analysis and verification of correctness, and opens the way for automaticcode generation. The notation is particularly useful in parallelization, for the systematic construction of parallel programs in a process/channel programming paradigm (e.g., message passing. This is important because message passing, unfortunately, still is the only approach that leads to acceptable performance for many more unstructured or irregular problems on parallel computers that have non-uniform memory access times. It will be shown that the use of index sets and stencils greatly simplifies the determination of which data must be exchanged between different computing processes.

  19. Virtual reality-based prospective memory training program for people with acquired brain injury.

    Science.gov (United States)

    Yip, Ben C B; Man, David W K

    2013-01-01

    Acquired brain injuries (ABI) may display cognitive impairments and lead to long-term disabilities including prospective memory (PM) failure. Prospective memory serves to remember to execute an intended action in the future. PM problems would be a challenge to an ABI patient's successful community reintegration. While retrospective memory (RM) has been extensively studied, treatment programs for prospective memory are rarely reported. The development of a treatment program for PM, which is considered timely, can be cost-effective and appropriate to the patient's environment. A 12-session virtual reality (VR)-based cognitive rehabilitation program was developed using everyday PM activities as training content. 37 subjects were recruited to participate in a pretest-posttest control experimental study to evaluate its treatment effectiveness. Results suggest that significantly better changes were seen in both VR-based and real-life PM outcome measures, related cognitive attributes such as frontal lobe functions and semantic fluency. VR-based training may be well accepted by ABI patients as encouraging improvement has been shown. Large-scale studies of a virtual reality-based prospective memory (VRPM) training program are indicated.

  20. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

    KAUST Repository

    Haidar, Azzam

    2011-01-01

    This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating fine-grained and memory-aware computational tasks during both stages, while sustaining the application\\'s overall high performance. A dynamic runtime environment system then schedules the different tasks in an out-of-order fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50-fold and 12-fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000×24000. Copyright 2011 ACM.

  1. Application Portable Parallel Library

    Science.gov (United States)

    Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott

    1995-01-01

    Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.

  2. High Performance Programming Using Explicit Shared Memory Model on the Cray T3D

    Science.gov (United States)

    Saini, Subhash; Simon, Horst D.; Lasinski, T. A. (Technical Monitor)

    1994-01-01

    The Cray T3D is the first-phase system in Cray Research Inc.'s (CRI) three-phase massively parallel processing program. In this report we describe the architecture of the T3D, as well as the CRAFT (Cray Research Adaptive Fortran) programming model, and contrast it with PVM, which is also supported on the T3D We present some performance data based on the NAS Parallel Benchmarks to illustrate both architectural and software features of the T3D.

  3. Parallel implementation and evaluation of motion estimation system algorithms on a distributed memory multiprocessor using knowledge based mappings

    Science.gov (United States)

    Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.

    1989-01-01

    Several techniques to perform static and dynamic load balancing techniques for vision systems are presented. These techniques are novel in the sense that they capture the computational requirements of a task by examining the data when it is produced. Furthermore, they can be applied to many vision systems because many algorithms in different systems are either the same, or have similar computational characteristics. These techniques are evaluated by applying them on a parallel implementation of the algorithms in a motion estimation system on a hypercube multiprocessor system. The motion estimation system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from different time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters. It is shown that the performance gains when these data decomposition and load balancing techniques are used are significant and the overhead of using these techniques is minimal.

  4. Memory

    OpenAIRE

    Wager, Nadia

    2017-01-01

    This chapter will explore a response to traumatic victimisation which has divided the opinions of psychologists at an exponential rate. We will be examining amnesia for memories of childhood sexual abuse and the potential to recover these memories in adulthood. Whilst this phenomenon is generally accepted in clinical circles, it is seen as highly contentious amongst research psychologists, particularly experimental cognitive psychologists. The chapter will begin with a real case study of a wo...

  5. System of programming units for the K556RT4 and K556RT5 fixed programmed memory devices

    International Nuclear Information System (INIS)

    Bobkov, S.G.; Ermolin, Yu.V.; Kantserov, V.A.; Strigin, V.B.

    1983-01-01

    The programming system of constant programmable memory devices K556RT4 and K556RT5 that consist of two units (a programming device and an electrothermotraining unit) is described. The modules are made in the KAMAK standard. The programming device takes up 2 normal places, while the electrothermotraining block takes up 1 place. As information recording is done using a computer the time for programming is reduced and the possibility of errors is limited as compared with the manual method. The computer introduces the whole word to be recorded, not the separate parts, in the programming device. The transition to a new digit of a given word in the programming device is done automatically. This reduces the expense of computer time and accelerates the programming of microdiagrams

  6. Logical inference techniques for loop parallelization

    DEFF Research Database (Denmark)

    Oancea, Cosmin Eugen; Rauchwerger, Lawrence

    2012-01-01

    the parallelization transformation by verifying the independence of the loop's memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S={}, where S is a set expression representing array indexes. Using...... of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-CLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers....

  7. A parallel buffer tree

    DEFF Research Database (Denmark)

    Sitchinava, Nodar; Zeh, Norbert

    2012-01-01

    We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number of...... in the optimal OhOf(psortN + K/PB) parallel I/O complexity, where K is the size of the output reported in the process and psortN is the parallel I/O complexity of sorting N elements using P processors....

  8. Guidance system operations plan for manned CSM earth orbital and lunar missions using program COLOSSUS 3. Section 7: Erasable memory programs

    Science.gov (United States)

    Hamilton, M. H.

    1972-01-01

    Erasable-memory programs designed for guidance computers used in command and lunar modules are presented. The purpose, functional description, assumptions, restrictions, and imitations are given for each program.

  9. Neurite, a finite difference large scale parallel program for the simulation of electrical signal propagation in neurites under mechanical loading.

    Directory of Open Access Journals (Sweden)

    Julián A García-Grajales

    Full Text Available With the growing body of research on traumatic brain injury and spinal cord injury, computational neuroscience has recently focused its modeling efforts on neuronal functional deficits following mechanical loading. However, in most of these efforts, cell damage is generally only characterized by purely mechanistic criteria, functions of quantities such as stress, strain or their corresponding rates. The modeling of functional deficits in neurites as a consequence of macroscopic mechanical insults has been rarely explored. In particular, a quantitative mechanically based model of electrophysiological impairment in neuronal cells, Neurite, has only very recently been proposed. In this paper, we present the implementation details of this model: a finite difference parallel program for simulating electrical signal propagation along neurites under mechanical loading. Following the application of a macroscopic strain at a given strain rate produced by a mechanical insult, Neurite is able to simulate the resulting neuronal electrical signal propagation, and thus the corresponding functional deficits. The simulation of the coupled mechanical and electrophysiological behaviors requires computational expensive calculations that increase in complexity as the network of the simulated cells grows. The solvers implemented in Neurite--explicit and implicit--were therefore parallelized using graphics processing units in order to reduce the burden of the simulation costs of large scale scenarios. Cable Theory and Hodgkin-Huxley models were implemented to account for the electrophysiological passive and active regions of a neurite, respectively, whereas a coupled mechanical model accounting for the neurite mechanical behavior within its surrounding medium was adopted as a link between electrophysiology and mechanics. This paper provides the details of the parallel implementation of Neurite, along with three different application examples: a long myelinated axon

  10. Vectorization and parallelization of Monte-Carlo programs for calculation of radiation transport

    International Nuclear Information System (INIS)

    Seidel, R.

    1995-01-01

    The versatile MCNP-3B Monte-Carlo code written in FORTRAN77, for simulation of the radiation transport of neutral particles, has been subjected to vectorization and parallelization of essential parts, without touching its versatility. Vectorization is not dependent on a specific computer. Several sample tasks have been selected in order to test the vectorized MCNP-3B code in comparison to the scalar MNCP-3B code. The samples are a representative example of the 3-D calculations to be performed for simulation of radiation transport in neutron and reactor physics. (1) 4πneutron detector. (2) High-energy calorimeter. (3) PROTEUS benchmark (conversion rates and neutron multiplication factors for the HCLWR (High Conversion Light Water Reactor)). (orig./HP) [de

  11. Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications

    Science.gov (United States)

    Biswas, Rupak; Das, Sajal K.; Harvey, Daniel; Oliker, Leonid

    1999-01-01

    The ability to dynamically adapt an unstructured -rid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from the view point of portability on various multiprocessor platforms We address this problem by developing PLUM, tin automatic anti architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on, an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.

  12. Parallelization and automatic data distribution for nuclear reactor simulations

    Energy Technology Data Exchange (ETDEWEB)

    Liebrock, L.M. [Liebrock-Hicks Research, Calumet, MI (United States)

    1997-07-01

    Detailed attempts at realistic nuclear reactor simulations currently take many times real time to execute on high performance workstations. Even the fastest sequential machine can not run these simulations fast enough to ensure that the best corrective measure is used during a nuclear accident to prevent a minor malfunction from becoming a major catastrophe. Since sequential computers have nearly reached the speed of light barrier, these simulations will have to be run in parallel to make significant improvements in speed. In physical reactor plants, parallelism abounds. Fluids flow, controls change, and reactions occur in parallel with only adjacent components directly affecting each other. These do not occur in the sequentialized manner, with global instantaneous effects, that is often used in simulators. Development of parallel algorithms that more closely approximate the real-world operation of a reactor may, in addition to speeding up the simulations, actually improve the accuracy and reliability of the predictions generated. Three types of parallel architecture (shared memory machines, distributed memory multicomputers, and distributed networks) are briefly reviewed as targets for parallelization of nuclear reactor simulation. Various parallelization models (loop-based model, shared memory model, functional model, data parallel model, and a combined functional and data parallel model) are discussed along with their advantages and disadvantages for nuclear reactor simulation. A variety of tools are introduced for each of the models. Emphasis is placed on the data parallel model as the primary focus for two-phase flow simulation. Tools to support data parallel programming for multiple component applications and special parallelization considerations are also discussed.

  13. Parallelization and automatic data distribution for nuclear reactor simulations

    International Nuclear Information System (INIS)

    Liebrock, L.M.

    1997-01-01

    Detailed attempts at realistic nuclear reactor simulations currently take many times real time to execute on high performance workstations. Even the fastest sequential machine can not run these simulations fast enough to ensure that the best corrective measure is used during a nuclear accident to prevent a minor malfunction from becoming a major catastrophe. Since sequential computers have nearly reached the speed of light barrier, these simulations will have to be run in parallel to make significant improvements in speed. In physical reactor plants, parallelism abounds. Fluids flow, controls change, and reactions occur in parallel with only adjacent components directly affecting each other. These do not occur in the sequentialized manner, with global instantaneous effects, that is often used in simulators. Development of parallel algorithms that more closely approximate the real-world operation of a reactor may, in addition to speeding up the simulations, actually improve the accuracy and reliability of the predictions generated. Three types of parallel architecture (shared memory machines, distributed memory multicomputers, and distributed networks) are briefly reviewed as targets for parallelization of nuclear reactor simulation. Various parallelization models (loop-based model, shared memory model, functional model, data parallel model, and a combined functional and data parallel model) are discussed along with their advantages and disadvantages for nuclear reactor simulation. A variety of tools are introduced for each of the models. Emphasis is placed on the data parallel model as the primary focus for two-phase flow simulation. Tools to support data parallel programming for multiple component applications and special parallelization considerations are also discussed

  14. Past, Present and Future. Dull Knife Memorial College (Indian Action Program Inc.).

    Science.gov (United States)

    1978

    Five vocational training programs as well as academic coursework are offered on the Northern Cheyenne Reservation by Dull Knife Memorial College. Established and operated by the Northern Cheyenne, and located in Lame Deer, Montana, the college was chartered by a tribal ordinance in 1975. Approximately 75 trainees are currently involved in the…

  15. An Evaluation of a Teacher Training Program at the United States Holocaust Memorial Museum

    Science.gov (United States)

    DeBerry, LaMonnia Edge

    2015-01-01

    The purpose of this mixed methods study was to explore the effects of the United States Holocaust Memorial Museum's work in partnering with professors from universities across the United States during a 1-year collaborative partnership through an educational program referred to as Belfer First Step Holocaust Institute for Teacher Educators (BFS…

  16. Practical parallel computing

    CERN Document Server

    Morse, H Stephen

    1994-01-01

    Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi

  17. Operational Semantics of a Weak Memory Model inspired by Go

    OpenAIRE

    Fava, Daniel Schnetzer; Stolz, Volker; Valle, Stian

    2017-01-01

    A memory model dictates which values may be returned when reading from memory. In a parallel computing setting, the memory model affects how processes communicate through shared memory. The design of a proper memory model is a balancing act. On one hand, memory models must be lax enough to allow common hardware and compiler optimizations. On the other, the more lax the model, the harder it is for developers to reason about their programs. In order to alleviate the burden on programmers, a wea...

  18. Can We Efficiently Check Concurrent Programs Under Relaxed Memory Models in Maude?

    DEFF Research Database (Denmark)

    Arrahman, Yehia Abd; Andric, Marina; Beggiato, Alessandro

    2014-01-01

    to the state space explosion. Several techniques have been proposed to mitigate those problems so to make verification under relaxed memory models feasible. We discuss how to adopt some of those techniques in a Maude-based approach to language prototyping, and suggest the use of other techniques that have been......Relaxed memory models offer suitable abstractions of the actual optimizations offered by multi-core architectures and by compilers of concurrent programming languages. Using such abstractions for verification purposes is challenging in part due to their inherent non-determinism which contributes...

  19. Memories.

    Science.gov (United States)

    Brand, Judith, Ed.

    1998-01-01

    This theme issue of the journal "Exploring" covers the topic of "memories" and describes an exhibition at San Francisco's Exploratorium that ran from May 22, 1998 through January 1999 and that contained over 40 hands-on exhibits, demonstrations, artworks, images, sounds, smells, and tastes that demonstrated and depicted the biological,…

  20. Silver Memories: implementation and evaluation of a unique radio program for older people.

    Science.gov (United States)

    Travers, Catherine; Bartlett, Helen P

    2011-03-01

    A unique radio program, Silver Memories, specifically designed to address social isolation and loneliness in older people by broadcasting music (primarily), serials and other programs relevant to the period when older people grew up--the 1920-1950s--first aired in Brisbane, Australia, in April 2008. The impact of the program upon older listeners' mood, quality of life (QOL) and self-reported loneliness was independently evaluated. One hundred and thirteen community-dwelling persons and residents of residential care facilities, aged 60 years and older participated in a three month evaluation of Silver Memories. They were asked to listen to the program daily and baseline and follow-up measures of depression, QOL and loneliness were obtained. Participants were also asked for their opinions regarding the program's quality and appeal. The results showed a statistically significant improvement in measures of depression and QOL from baseline to follow-up but there was no change on the measure of loneliness. The results did not vary by living situation (community vs. residential care), whether the participant was lonely or not lonely, socially isolated or not isolated, or whether there had been any important changes in the participant's health or social circumstances throughout the evaluation. It was concluded that listening to Silver Memories appears to improve the QOL and mood of older people and is an inexpensive intervention that is flexible and readily implemented.

  1. Parallel computation

    International Nuclear Information System (INIS)

    Jejcic, A.; Maillard, J.; Maurel, G.; Silva, J.; Wolff-Bacha, F.

    1997-01-01

    The work in the field of parallel processing has developed as research activities using several numerical Monte Carlo simulations related to basic or applied current problems of nuclear and particle physics. For the applications utilizing the GEANT code development or improvement works were done on parts simulating low energy physical phenomena like radiation, transport and interaction. The problem of actinide burning by means of accelerators was approached using a simulation with the GEANT code. A program of neutron tracking in the range of low energies up to the thermal region has been developed. It is coupled to the GEANT code and permits in a single pass the simulation of a hybrid reactor core receiving a proton burst. Other works in this field refers to simulations for nuclear medicine applications like, for instance, development of biological probes, evaluation and characterization of the gamma cameras (collimators, crystal thickness) as well as the method for dosimetric calculations. Particularly, these calculations are suited for a geometrical parallelization approach especially adapted to parallel machines of the TN310 type. Other works mentioned in the same field refer to simulation of the electron channelling in crystals and simulation of the beam-beam interaction effect in colliders. The GEANT code was also used to simulate the operation of germanium detectors designed for natural and artificial radioactivity monitoring of environment

  2. Comparative Study of Dynamic Programming and Pontryagin’s Minimum Principle on Energy Management for a Parallel Hybrid Electric Vehicle

    Directory of Open Access Journals (Sweden)

    Huei Peng

    2013-04-01

    Full Text Available This paper compares two optimal energy management methods for parallel hybrid electric vehicles using an Automatic Manual Transmission (AMT. A control-oriented model of the powertrain and vehicle dynamics is built first. The energy management is formulated as a typical optimal control problem to trade off the fuel consumption and gear shifting frequency under admissible constraints. The Dynamic Programming (DP and Pontryagin’s Minimum Principle (PMP are applied to obtain the optimal solutions. Tuning with the appropriate co-states, the PMP solution is found to be very close to that from DP. The solution for the gear shifting in PMP has an algebraic expression associated with the vehicular velocity and can be implemented more efficiently in the control algorithm. The computation time of PMP is significantly less than DP.

  3. The Effect of Programmed Physical Exercise to Attention and Working Memory Score in Medical Students

    Directory of Open Access Journals (Sweden)

    Kevin Fachri Muhammad

    2015-06-01

    Full Text Available Background: Attention and working memory are two cognitive domain crucial for activities of daily living. Physical exercise increases the level of BDNF, IGF-1, and VEGF which contributes in attention and working memory processes.This study was conducted to analyze improvement of attention and working memory after programmed physical exercise of Pendidikan Dasar XXI Atlas Medical Pioneer (Pendas XXI AMP. Methods: An analytic observational study was conducted on 47 students from Faculty of Medicine, Universitas Padjadjaran during September-November 2012. Attention was assessed using digit span backward test, stroop test, visual search task, and trail making test. Working memory was assessed using digit span forward test and digit symbol test. Assessment was done on the 11th and 19th week of Pendas XXI AMP. Data distribution was tested first using a test of normality, and then analyzed using T-Dependent Test and Wilcoxon Test Results: Significant improvement was noted for attention in males based on working time for stroop test (26.50±5.66 to 22.03±3.78 seconds, working memory in males based on digit symbol test score (43.96±6.14 to 53.36±5.26 points, attention in females based on reaction time of visual search task for target absent (0.92±0.07 to 0.87±0.07 seconds, and working memory in females based on digit span forward score (5.42±1.30 to 6.63±1.07 points and digit symbol test score (42.47±5.95 to 53.84±5.33 points. Conclusions: Exercise in Pendas XXI AMP improves attention and working memory for college students in Faculty of Medicine Universitas Padjadjaran.

  4. Early programming and late-acting checkpoints governing the development of CD4 T cell memory.

    Science.gov (United States)

    Dhume, Kunal; McKinstry, K Kai

    2018-04-27

    CD4 T cells contribute to protection against pathogens through numerous mechanisms. Incorporating the goal of memory CD4 T cell generation into vaccine strategies thus offers a powerful approach to improve their efficacy, especially in situations where humoral responses alone cannot confer long-term immunity. These threats include viruses such as influenza that mutate coat proteins to avoid neutralizing antibodies, but that are targeted by T cells that recognize more conserved protein epitopes shared by different strains. A major barrier in the design of such vaccines is that the mechanisms controlling the efficiency with which memory cells form remain incompletely understood. Here, we discuss recent insights into fate decisions controlling memory generation. We focus on the importance of three general cues: interleukin-2, antigen, and costimulatory interactions. It is increasingly clear that these signals have a powerful influence on the capacity of CD4 T cells to form memory during two distinct phases of the immune response. First, through 'programming' that occurs during initial priming, and second, through 'checkpoints' that operate later during the effector stage. These findings indicate that novel vaccine strategies must seek to optimize cognate interactions, during which interleukin-2-, antigen, and costimulation-dependent signals are tightly linked, well beyond initial antigen encounter to induce robust memory CD4 T cells. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.

  5. Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization

    KAUST Repository

    Belli, Roberto; Hoefler, Torsten

    2015-01-01

    Remote Memory Access (RMA) programming enables direct access to low-level hardware features to achieve high performance for distributed-memory programs. However, the design of RMA programming schemes focuses on the memory access and less on the synchronization. For example, in contemporary RMA programming systems, the widely used producer-consumer pattern can only be implemented inefficiently, incurring in an overhead of an additional round-trip message. We propose Notified Access, a scheme where the target process of an access can receive a completion notification. This scheme enables direct and efficient synchronization with a minimum number of messages. We implement our scheme in an open source MPI-3 RMA library and demonstrate lower overheads (two cache misses) than other point-to-point synchronization mechanisms for each notification. We also evaluate our implementation on three real-world benchmarks, a stencil computation, a tree computation, and a Colicky factorization implemented with tasks. Our scheme always performs better than traditional message passing and other existing RMA synchronization schemes, providing up to 50% speedup on small messages. Our analysis shows that Notified Access is a valuable primitive for any RMA system. Furthermore, we provide guidance for the design of low-level network interfaces to support Notified Access efficiently.

  6. Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization

    KAUST Repository

    Belli, Roberto

    2015-05-01

    Remote Memory Access (RMA) programming enables direct access to low-level hardware features to achieve high performance for distributed-memory programs. However, the design of RMA programming schemes focuses on the memory access and less on the synchronization. For example, in contemporary RMA programming systems, the widely used producer-consumer pattern can only be implemented inefficiently, incurring in an overhead of an additional round-trip message. We propose Notified Access, a scheme where the target process of an access can receive a completion notification. This scheme enables direct and efficient synchronization with a minimum number of messages. We implement our scheme in an open source MPI-3 RMA library and demonstrate lower overheads (two cache misses) than other point-to-point synchronization mechanisms for each notification. We also evaluate our implementation on three real-world benchmarks, a stencil computation, a tree computation, and a Colicky factorization implemented with tasks. Our scheme always performs better than traditional message passing and other existing RMA synchronization schemes, providing up to 50% speedup on small messages. Our analysis shows that Notified Access is a valuable primitive for any RMA system. Furthermore, we provide guidance for the design of low-level network interfaces to support Notified Access efficiently.

  7. A database for on-line event analysis on a distributed memory machine

    CERN Document Server

    Argante, E; Van der Stok, P D V; Willers, Ian Malcolm

    1995-01-01

    Parallel in-memory databases can enhance the structuring and parallelization of programs used in High Energy Physics (HEP). Efficient database access routines are used as communication primitives which hide the communication topology in contrast to the more explicit communications like PVM or MPI. A parallel in-memory database, called SPIDER, has been implemented on a 32 node Meiko CS-2 distributed memory machine. The spider primitives generate a lower overhead than the one generated by PVM or PMI. The event reconstruction program, CPREAD of the CPLEAR experiment, has been used as a test case. Performance measurerate generated by CPLEAR.

  8. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

    Directory of Open Access Journals (Sweden)

    Robert Gerstenberger

    2014-01-01

    Full Text Available Modern interconnects offer remote direct memory access (RDMA features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.

  9. Benefits of a Classroom Based Instrumental Music Program on Verbal Memory of Primary School Children: A Longitudinal Study

    Science.gov (United States)

    Rickard, Nikki S.; Vasquez, Jorge T.; Murphy, Fintan; Gill, Anneliese; Toukhsati, Samia R.

    2010-01-01

    Previous research has demonstrated a benefit of music training on a number of cognitive functions including verbal memory performance. The impact of school-based music programs on memory processes is however relatively unknown. The current study explored the effect of increasing frequency and intensity of classroom-based instrumental training…

  10. Understanding Notional Machines through Traditional Teaching with Conceptual Contraposition and Program Memory Tracing

    Directory of Open Access Journals (Sweden)

    Jeisson Hidalgo-Céspedes

    2016-08-01

    Full Text Available A correct understanding about how computers run code is mandatory in order to effectively learn to program. Lectures have historically been used in programming courses to teach how computers execute code, and students are assessed through traditional evaluation methods, such as exams. Constructivism learning theory objects to students’ passiveness during lessons, and traditional quantitative methods for evaluating a complex cognitive process such as understanding. Constructivism proposes complimentary techniques, such as conceptual contraposition and colloquies. We enriched lectures of a “Programming II” (CS2 course combining conceptual contraposition with program memory tracing, then we evaluated students’ understanding of programming concepts through colloquies. Results revealed that these techniques applied to the lecture are insufficient to help students develop satisfactory mental models of the C++ notional machine, and colloquies behaved as the most comprehensive traditional evaluations conducted in the course.

  11. Expressing Coarse-Grain Dependencies Among Tasks in Shared Memory Programs

    DEFF Research Database (Denmark)

    Larsen, Per; Karlsson, Sven; Madsen, Jan

    2011-01-01

    Designers of embedded systems face tight constraints on resources, response time and cost. The ability to analyze embedded systems is essential to timely delivery of new designs. Many analysis techniques model parallel programs as task graphs. Task graphs capture the worst-case execution times....... The overhead of verifying the correct use of the directives was measured on a set of benchmarks on two platforms. The overhead of runtime checks was found to be negligible in all instances....

  12. Getting To Exascale: Applying Novel Parallel Programming Models To Lab Applications For The Next Generation Of Supercomputers

    Energy Technology Data Exchange (ETDEWEB)

    Dube, Evi [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Shereda, Charles [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Nau, Lee [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Harris, Lance [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)

    2010-09-27

    As supercomputing moves toward exascale, node architectures will change significantly. CPU core counts on nodes will increase by an order of magnitude or more. Heterogeneous architectures will become more commonplace, with GPUs or FPGAs providing additional computational power. Novel programming models may make better use of on-node parallelism in these new architectures than do current models. In this paper we examine several of these novel models – UPC, CUDA, and OpenCL –to determine their suitability to LLNL scientific application codes. Our study consisted of several phases: We conducted interviews with code teams and selected two codes to port; We learned how to program in the new models and ported the codes; We debugged and tuned the ported applications; We measured results, and documented our findings. We conclude that UPC is a challenge for porting code, Berkeley UPC is not very robust, and UPC is not suitable as a general alternative to OpenMP for a number of reasons. CUDA is well supported and robust but is a proprietary NVIDIA standard, while OpenCL is an open standard. Both are well suited to a specific set of application problems that can be run on GPUs, but some problems are not suited to GPUs. Further study of the landscape of novel models is recommended.

  13. Introducing PROFESS 2.0: A parallelized, fully linear scaling program for orbital-free density functional theory calculations

    Science.gov (United States)

    Hung, Linda; Huang, Chen; Shin, Ilgyou; Ho, Gregory S.; Lignères, Vincent L.; Carter, Emily A.

    2010-12-01

    Orbital-free density functional theory (OFDFT) is a first principles quantum mechanics method to find the ground-state energy of a system by variationally minimizing with respect to the electron density. No orbitals are used in the evaluation of the kinetic energy (unlike Kohn-Sham DFT), and the method scales nearly linearly with the size of the system. The PRinceton Orbital-Free Electronic Structure Software (PROFESS) uses OFDFT to model materials from the atomic scale to the mesoscale. This new version of PROFESS allows the study of larger systems with two significant changes: PROFESS is now parallelized, and the ion-electron and ion-ion terms scale quasilinearly, instead of quadratically as in PROFESS v1 (L. Hung and E.A. Carter, Chem. Phys. Lett. 475 (2009) 163). At the start of a run, PROFESS reads the various input files that describe the geometry of the system (ion positions and cell dimensions), the type of elements (defined by electron-ion pseudopotentials), the actions you want it to perform (minimize with respect to electron density and/or ion positions and/or cell lattice vectors), and the various options for the computation (such as which functionals you want it to use). Based on these inputs, PROFESS sets up a computation and performs the appropriate optimizations. Energies, forces, stresses, material geometries, and electron density configurations are some of the values that can be output throughout the optimization. New version program summaryProgram Title: PROFESS Catalogue identifier: AEBN_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEBN_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 68 721 No. of bytes in distributed program, including test data, etc.: 1 708 547 Distribution format: tar.gz Programming language: Fortran 90 Computer

  14. Micro-mechanical Simulations of Soils using Massively Parallel Supercomputers

    Directory of Open Access Journals (Sweden)

    David W. Washington

    2004-06-01

    Full Text Available In this research a computer program, Trubal version 1.51, based on the Discrete Element Method was converted to run on a Connection Machine (CM-5,a massively parallel supercomputer with 512 nodes, to expedite the computational times of simulating Geotechnical boundary value problems. The dynamic memory algorithm in Trubal program did not perform efficiently in CM-2 machine with the Single Instruction Multiple Data (SIMD architecture. This was due to the communication overhead involving global array reductions, global array broadcast and random data movement. Therefore, a dynamic memory algorithm in Trubal program was converted to a static memory arrangement and Trubal program was successfully converted to run on CM-5 machines. The converted program was called "TRUBAL for Parallel Machines (TPM." Simulating two physical triaxial experiments and comparing simulation results with Trubal simulations validated the TPM program. With a 512 nodes CM-5 machine TPM produced a nine-fold speedup demonstrating the inherent parallelism within algorithms based on the Discrete Element Method.

  15. Parallel models of associative memory

    CERN Document Server

    Hinton, Geoffrey E

    2014-01-01

    This update of the 1981 classic on neural networks includes new commentaries by the authors that show how the original ideas are related to subsequent developments. As researchers continue to uncover ways of applying the complex information processing abilities of neural networks, they give these models an exciting future which may well involve revolutionary developments in understanding the brain and the mind -- developments that may allow researchers to build adaptive intelligent machines. The original chapters show where the ideas came from and the new commentaries show where they are going

  16. Effects of a Memory and Visual-Motor Integration Program for Older Adults Based on Self-Efficacy Theory.

    Science.gov (United States)

    Kim, Eun Hwi; Suh, Soon Rim

    2017-06-01

    This study was conducted to verify the effects of a memory and visual-motor integration program for older adults based on self-efficacy theory. A non-equivalent control group pretest-posttest design was implemented in this quasi-experimental study. The participants were 62 older adults from senior centers and older adult welfare facilities in D and G city (Experimental group=30, Control group=32). The experimental group took part in a 12-session memory and visual-motor integration program over 6 weeks. Data regarding memory self-efficacy, memory, visual-motor integration, and depression were collected from July to October of 2014 and analyzed with independent t-test and Mann-Whitney U test using PASW Statistics (SPSS) 18.0 to determine the effects of the interventions. Memory self-efficacy (t=2.20, p=.031), memory (Z=-2.92, p=.004), and visual-motor integration (Z=-2.49, p=.013) increased significantly in the experimental group as compared to the control group. However, depression (Z=-0.90, p=.367) did not decrease significantly. This program is effective for increasing memory, visual-motor integration, and memory self-efficacy in older adults. Therefore, it can be used to improve cognition and prevent dementia in older adults. © 2017 Korean Society of Nursing Science

  17. Design strategies for irregularly adapting parallel applications

    International Nuclear Information System (INIS)

    Oliker, Leonid; Biswas, Rupak; Shan, Hongzhang; Sing, Jaswinder Pal

    2000-01-01

    Achieving scalable performance for dynamic irregular applications is eminently challenging. Traditional message-passing approaches have been making steady progress towards this goal; however, they suffer from complex implementation requirements. The use of a global address space greatly simplifies the programming task, but can degrade the performance of dynamically adapting computations. In this work, we examine two major classes of adaptive applications, under five competing programming methodologies and four leading parallel architectures. Results indicate that it is possible to achieve message-passing performance using shared-memory programming techniques by carefully following the same high level strategies. Adaptive applications have computational work loads and communication patterns which change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient parallel implementations of such adaptive applications are therefore a challenging task. This work examines the implementation of two typical adaptive applications, Dynamic Remeshing and N-Body, across various programming paradigms and architectural platforms. We compare several critical factors of the parallel code development, including performance, programmability, scalability, algorithmic development, and portability

  18. Parallelism in matrix computations

    CERN Document Server

    Gallopoulos, Efstratios; Sameh, Ahmed H

    2016-01-01

    This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...

  19. Massively parallel sparse matrix function calculations with NTPoly

    Science.gov (United States)

    Dawson, William; Nakajima, Takahito

    2018-04-01

    We present NTPoly, a massively parallel library for computing the functions of sparse, symmetric matrices. The theory of matrix functions is a well developed framework with a wide range of applications including differential equations, graph theory, and electronic structure calculations. One particularly important application area is diagonalization free methods in quantum chemistry. When the input and output of the matrix function are sparse, methods based on polynomial expansions can be used to compute matrix functions in linear time. We present a library based on these methods that can compute a variety of matrix functions. Distributed memory parallelization is based on a communication avoiding sparse matrix multiplication algorithm. OpenMP task parallellization is utilized to implement hybrid parallelization. We describe NTPoly's interface and show how it can be integrated with programs written in many different programming languages. We demonstrate the merits of NTPoly by performing large scale calculations on the K computer.

  20. A program for undergraduate research into the mechanisms of sensory coding and memory decay

    Energy Technology Data Exchange (ETDEWEB)

    Calin-Jageman, R J

    2010-09-28

    This is the final technical report for this DOE project, entitltled "A program for undergraduate research into the mechanisms of sensory coding and memory decay". The report summarizes progress on the three research aims: 1) to identify phyisological and genetic correlates of long-term habituation, 2) to understand mechanisms of olfactory coding, and 3) to foster a world-class undergraduate neuroscience program. Progress on the first aim has enabled comparison of learning-regulated transcripts across closely related learning paradigms and species, and results suggest that only a small core of transcripts serve truly general roles in long-term memory. Progress on the second aim has enabled testing of several mutant phenotypes for olfactory behaviors, and results show that responses are not fully consistent with the combinitoral coding hypothesis. Finally, 14 undergraduate students participated in this research, the neuroscience program attracted extramural funding, and we completed a successful summer program to enhance transitions for community-college students into 4-year colleges to persue STEM fields.

  1. Apollo guidance, navigation and control: Guidance system operations plans for manned LM earth orbital and lunar missions using Program COLOSSUS 3. Section 7: Erasable memory programs

    Science.gov (United States)

    Hamilton, M. H.

    1972-01-01

    Erasable-memory programs (EMPs) designed for the guidance computers used in the command (CMC) and lunar modules (LGC) are described. CMC programs are designated COLOSSUS 3, and the associated EMPs are identified by a three-digit number beginning with 5. LGC programs are designated LUMINARY 1E, and the associated EMPs are identified, with one exception, by a three-digit number beginning with 1. The exception is EMP 99. The EMPs vary in complexity from a simple flagbit setting to a long and intricate logical structure. They all, however, cause the computer to behave in a way not intended in the original design of the programs; they accomplish this off-nominal behavior by some alteration of erasable memory to interface with existing fixed-memory programs to effect a desired result.

  2. Rubus: A compiler for seamless and extensible parallelism.

    Directory of Open Access Journals (Sweden)

    Muhammad Adnan

    Full Text Available Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU, originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84

  3. Memory-based frame synchronizer. [for digital communication systems

    Science.gov (United States)

    Stattel, R. J.; Niswander, J. K. (Inventor)

    1981-01-01

    A frame synchronizer for use in digital communications systems wherein data formats can be easily and dynamically changed is described. The use of memory array elements provide increased flexibility in format selection and sync word selection in addition to real time reconfiguration ability. The frame synchronizer comprises a serial-to-parallel converter which converts a serial input data stream to a constantly changing parallel data output. This parallel data output is supplied to programmable sync word recognizers each consisting of a multiplexer and a random access memory (RAM). The multiplexer is connected to both the parallel data output and an address bus which may be connected to a microprocessor or computer for purposes of programming the sync word recognizer. The RAM is used as an associative memory or decorder and is programmed to identify a specific sync word. Additional programmable RAMs are used as counter decoders to define word bit length, frame word length, and paragraph frame length.

  4. Parallel Monte Carlo reactor neutronics

    International Nuclear Information System (INIS)

    Blomquist, R.N.; Brown, F.B.

    1994-01-01

    The issues affecting implementation of parallel algorithms for large-scale engineering Monte Carlo neutron transport simulations are discussed. For nuclear reactor calculations, these include load balancing, recoding effort, reproducibility, domain decomposition techniques, I/O minimization, and strategies for different parallel architectures. Two codes were parallelized and tested for performance. The architectures employed include SIMD, MIMD-distributed memory, and workstation network with uneven interactive load. Speedups linear with the number of nodes were achieved

  5. SOFTWARE FOR DESIGNING PARALLEL APPLICATIONS

    Directory of Open Access Journals (Sweden)

    M. K. Bouza

    2017-01-01

    Full Text Available The object of research is the tools to support the development of parallel programs in C/C ++. The methods and software which automates the process of designing parallel applications are proposed.

  6. Parallel hierarchical global illumination

    Energy Technology Data Exchange (ETDEWEB)

    Snell, Quinn O. [Iowa State Univ., Ames, IA (United States)

    1997-10-08

    Solving the global illumination problem is equivalent to determining the intensity of every wavelength of light in all directions at every point in a given scene. The complexity of the problem has led researchers to use approximation methods for solving the problem on serial computers. Rather than using an approximation method, such as backward ray tracing or radiosity, the authors have chosen to solve the Rendering Equation by direct simulation of light transport from the light sources. This paper presents an algorithm that solves the Rendering Equation to any desired accuracy, and can be run in parallel on distributed memory or shared memory computer systems with excellent scaling properties. It appears superior in both speed and physical correctness to recent published methods involving bidirectional ray tracing or hybrid treatments of diffuse and specular surfaces. Like progressive radiosity methods, it dynamically refines the geometry decomposition where required, but does so without the excessive storage requirements for ray histories. The algorithm, called Photon, produces a scene which converges to the global illumination solution. This amounts to a huge task for a 1997-vintage serial computer, but using the power of a parallel supercomputer significantly reduces the time required to generate a solution. Currently, Photon can be run on most parallel environments from a shared memory multiprocessor to a parallel supercomputer, as well as on clusters of heterogeneous workstations.

  7. PKA Increases in the Olfactory Bulb Act as Unconditioned Stimuli and Provide Evidence for Parallel Memory Systems: Pairing Odor with Increased PKA Creates Intermediate- and Long-Term, but Not Short-Term, Memories

    Science.gov (United States)

    Grimes, Matthew T.; Harley, Carolyn W.; Darby-King, Andrea; McLean, John H.

    2012-01-01

    Neonatal odor-preference memory in rat pups is a well-defined associative mammalian memory model dependent on cAMP. Previous work from this laboratory demonstrates three phases of neonatal odor-preference memory: short-term (translation-independent), intermediate-term (translation-dependent), and long-term (transcription- and…

  8. CUBESIM, Hypercube and Denelcor Hep Parallel Computer Simulation

    International Nuclear Information System (INIS)

    Dunigan, T.H.

    1988-01-01

    1 - Description of program or function: CUBESIM is a set of subroutine libraries and programs for the simulation of message-passing parallel computers and shared-memory parallel computers. Subroutines are supplied to simulate the Intel hypercube and the Denelcor HEP parallel computers. The system permits a user to develop and test parallel programs written in C or FORTRAN on a single processor. The user may alter such hypercube parameters as message startup times, packet size, and the computation-to-communication ratio. The simulation generates a trace file that can be used for debugging, performance analysis, or graphical display. 2 - Method of solution: The CUBESIM simulator is linked with the user's parallel application routines to run as a single UNIX process. The simulator library provides a small operating system to perform process and message management. 3 - Restrictions on the complexity of the problem: Up to 128 processors can be simulated with a virtual memory limit of 6 million bytes. Up to 1000 processes can be simulated

  9. ParaHaplo 3.0: A program package for imputation and a haplotype-based whole-genome association study using hybrid parallel computing

    Directory of Open Access Journals (Sweden)

    Kamatani Naoyuki

    2011-05-01

    Full Text Available Abstract Background Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required. Results We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo. Conclusions ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at http://en.sourceforge.jp/projects/parallelgwas/releases/.

  10. Adolescent development, hypothalamic-pituitary-adrenal function, and programming of adult learning and memory.

    Science.gov (United States)

    McCormick, Cheryl M; Mathews, Iva Z

    2010-06-30

    Chronic exposure to stress is known to affect learning and memory in adults through the release of glucocorticoid hormones by the hypothalamic-pituitary-adrenal (HPA) axis. In adults, glucocorticoids alter synaptic structure and function in brain regions that express high levels of glucocorticoid receptors and that mediate goal-directed behaviour and learning and memory. In contrast to relatively transient effects of stress on cognitive function in adulthood, exposure to high levels of glucocorticoids in early life can produce enduring changes through substantial remodeling of the developing nervous system. Adolescence is another time of significant brain development and maturation of the HPA axis, thereby providing another opportunity for glucocorticoids to exert programming effects on neurocircuitry involved in learning and memory. These topics are reviewed, as is the emerging research evidence in rodent models highlighting that adolescence may be a period of increased vulnerability compared to adulthood in which exposure to high levels of glucocorticoids results in enduring changes in adult cognitive function. Copyright 2009 Elsevier Inc. All rights reserved.

  11. Working Memory Training in Children with Mild Intellectual Disability, Through Designed Computerized Program

    Directory of Open Access Journals (Sweden)

    Mona Delavarian

    2015-12-01

    Full Text Available Objectives: The aim of this research is designing a computerized program, in game format, for working memory training in mild intellectual disabled children. Methods: 24 students participated as test and control groups. The auditory and visual-spatial WM were assessed by primary test, which included computerized Wechsler numerical forward and backward sub- tests, and secondary tests, which contained three parts: dual visual-spatial test, auditory test, and a one-syllable word recalling test. Results: The results showed significant differnces between working memory capacity in the intellectually disabled children and normal ones (P-value<0.00001. After using the computerized working memory training, Visual-spatial WM, auditory WM, and speaking were improved in the trained group. The mentioned four tests showed significant differences between pre-test and post-test. The trained group showed more improvements in forward tasks. The trained participant’s processing speed increased with training. Discussion: According to the results, comprehensive human-computer interfaces and the aplication of computer in children training, especially in traing of intellectual disabled children with impairements in visual and auditory perceptions, could be more effective and vaulable.

  12. Execution Model of Three Parallel Languages: OpenMP, UPC and CAF

    Directory of Open Access Journals (Sweden)

    Ami Marowka

    2005-01-01

    Full Text Available The aim of this paper is to present a qualitative evaluation of three state-of-the-art parallel languages: OpenMP, Unified Parallel C (UPC and Co-Array Fortran (CAF. OpenMP and UPC are explicit parallel programming languages based on the ANSI standard. CAF is an implicit programming language. On the one hand, OpenMP designs for shared-memory architectures and extends the base-language by using compiler directives that annotate the original source-code. On the other hand, UPC and CAF designs for distribute-shared memory architectures and extends the base-language by new parallel constructs. We deconstruct each language into its basic components, show examples, make a detailed analysis, compare them, and finally draw some conclusions.

  13. Parallelization in Modern C++

    CERN Multimedia

    CERN. Geneva

    2016-01-01

    The traditionally used and well established parallel programming models OpenMP and MPI are both targeting lower level parallelism and are meant to be as language agnostic as possible. For a long time, those models were the only widely available portable options for developing parallel C++ applications beyond using plain threads. This has strongly limited the optimization capabilities of compilers, has inhibited extensibility and genericity, and has restricted the use of those models together with other, modern higher level abstractions introduced by the C++11 and C++14 standards. The recent revival of interest in the industry and wider community for the C++ language has also spurred a remarkable amount of standardization proposals and technical specifications being developed. Those efforts however have so far failed to build a vision on how to seamlessly integrate various types of parallelism, such as iterative parallel execution, task-based parallelism, asynchronous many-task execution flows, continuation s...

  14. Flynn Effects on Sub-Factors of Episodic and Semantic Memory: Parallel Gains over Time and the Same Set of Determining Factors

    Science.gov (United States)

    Ronnlund, Michael; Nilsson, Lars-Goran.

    2009-01-01

    The study examined the extent to which time-related gains in cognitive performance, so-called Flynn effects, generalize across sub-factors of episodic memory (recall and recognition) and semantic memory (knowledge and fluency). We conducted time-sequential analyses of data drawn from the Betula prospective cohort study, involving four age-matched…

  15. Feasibility study of current pulse induced 2-bit/4-state multilevel programming in phase-change memory

    Science.gov (United States)

    Liu, Yan; Fan, Xi; Chen, Houpeng; Wang, Yueqing; Liu, Bo; Song, Zhitang; Feng, Songlin

    2017-08-01

    In this brief, multilevel data storage for phase-change memory (PCM) has attracted more attention in the memory market to implement high capacity memory system and reduce cost-per-bit. In this work, we present a universal programing method of SET stair-case current pulse in PCM cells, which can exploit the optimum programing scheme to achieve 2-bit/ 4state resistance-level with equal logarithm interval. SET stair-case waveform can be optimized by TCAD real time simulation to realize multilevel data storage efficiently in an arbitrary phase change material. Experimental results from 1 k-bit PCM test-chip have validated the proposed multilevel programing scheme. This multilevel programming scheme has improved the information storage density, robustness of resistance-level, energy efficient and avoiding process complexity.

  16. Parallelization of the AliRoot event reconstruction by performing a semi- automatic source-code transformation

    CERN Multimedia

    CERN. Geneva

    2012-01-01

    side bus or processor interconnections. Parallelism can only result in performance gain, if the memory usage is optimized, memory locality improved and the communication between threads is minimized. But the domain of concurrent programming has become a field for highly skilled experts, as the implementation of multithreading is difficult, error prone and labor intensive. A full re-implementation for parallel execution of existing offline frameworks, like AliRoot in ALICE, is thus unaffordable. An alternative method, is to use a semi-automatic source-to-source transformation for getting a simple parallel design, with almost no interference between threads. This reduces the need of rewriting the develop...

  17. Parallel Computing in SCALE

    International Nuclear Information System (INIS)

    DeHart, Mark D.; Williams, Mark L.; Bowman, Stephen M.

    2010-01-01

    The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement

  18. Petri Nets Based Modelling of Control Flow for Memory-Aid Interactive Programs in Telemedicine

    CERN Document Server

    Khoromskaia, V K

    2004-01-01

    Petri Nets (PN) based modelling of the control flow for the interactive memory assistance programs designed for personal pocket computers and having special requirements for robustness is considered. The proposed concept allows one to elaborate the programs which can give users a variety of possibilities for a day-time planning in the presence of environmental and time restrictions. First, a PN model for a known simple algorithm is constructed and analyzed using the corresponding state equations and incidence matrix. Then a PN graph for a complicated algorithm with overlapping actions and choice possibilities is designed, supplemented by an example of its analysis. Dynamic behaviour of this graph is tested by tracing of all possible paths of the flow of control using the PN simulator. It is shown that PN based modelling provides reliably predictable performance of interactive algorithms with branched structures and concurrency requirements.

  19. Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

    Directory of Open Access Journals (Sweden)

    Hari Radhakrishnan

    2015-01-01

    Full Text Available This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were done using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.

  20. Parallel computations

    CERN Document Server

    1982-01-01

    Parallel Computations focuses on parallel computation, with emphasis on algorithms used in a variety of numerical and physical applications and for many different types of parallel computers. Topics covered range from vectorization of fast Fourier transforms (FFTs) and of the incomplete Cholesky conjugate gradient (ICCG) algorithm on the Cray-1 to calculation of table lookups and piecewise functions. Single tridiagonal linear systems and vectorized computation of reactive flow are also discussed.Comprised of 13 chapters, this volume begins by classifying parallel computers and describing techn

  1. Badlands: A parallel basin and landscape dynamics model

    Directory of Open Access Journals (Sweden)

    T. Salles

    2016-01-01

    Full Text Available Over more than three decades, a number of numerical landscape evolution models (LEMs have been developed to study the combined effects of climate, sea-level, tectonics and sediments on Earth surface dynamics. Most of them are written in efficient programming languages, but often cannot be used on parallel architectures. Here, I present a LEM which ports a common core of accepted physical principles governing landscape evolution into a distributed memory parallel environment. Badlands (acronym for BAsin anD LANdscape DynamicS is an open-source, flexible, TIN-based landscape evolution model, built to simulate topography development at various space and time scales.

  2. Parallelization of a Monte Carlo particle transport simulation code

    Science.gov (United States)

    Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

    2010-05-01

    We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.

  3. Silicon Carbide Defect Qubits/Quantum Memory with Field-Tuning: OSD Quantum Science and Engineering Program (QSEP)

    Science.gov (United States)

    2017-08-01

    TECHNICAL REPORT 3073 August 2017 Silicon Carbide Defect Qubits/Quantum Memory with Field-tuning: OSD Quantum Science and Engineering Program...Quantum Science and Engineering Program) by the Advanced Concepts and Applied Research Branch (Code 71730), the Energy and Environmental Sustainability...the Secretary of Defense (OSD) Quantum Science and Engineering Program (QSEP). Their collaboration topic was to examine the effect of electric-field

  4. Programs Lucky and LuckyC - 3D parallel transport codes for the multi-group transport equation solution for XYZ geometry by Pm Sn method

    International Nuclear Information System (INIS)

    Moriakov, A.; Vasyukhno, V.; Netecha, M.; Khacheresov, G.

    2003-01-01

    Powerful supercomputers are available today. MBC-1000M is one of Russian supercomputers that may be used by distant way access. Programs LUCKY and LUCKY C were created to work for multi-processors systems. These programs have algorithms created especially for these computers and used MPI (message passing interface) service for exchanges between processors. LUCKY may resolved shielding tasks by multigroup discreet ordinate method. LUCKY C may resolve critical tasks by same method. Only XYZ orthogonal geometry is available. Under little space steps to approximate discreet operator this geometry may be used as universal one to describe complex geometrical structures. Cross section libraries are used up to P8 approximation by Legendre polynomials for nuclear data in GIT format. Programming language is Fortran-90. 'Vector' processors may be used that lets get a time profit up to 30 times. But unfortunately MBC-1000M has not these processors. Nevertheless sufficient value for efficiency of parallel calculations was obtained under 'space' (LUCKY) and 'space and energy' (LUCKY C ) paralleling. AUTOCAD program is used to control geometry after a treatment of input data. Programs have powerful geometry module, it is a beautiful tool to achieve any geometry. Output results may be processed by graphic programs on personal computer. (authors)

  5. Block-Parallel Data Analysis with DIY2

    Energy Technology Data Exchange (ETDEWEB)

    Morozov, Dmitriy [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Peterka, Tom [Argonne National Lab. (ANL), Argonne, IL (United States)

    2017-08-30

    DIY2 is a programming model and runtime for block-parallel analytics on distributed-memory machines. Its main abstraction is block-structured data parallelism: data are decomposed into blocks; blocks are assigned to processing elements (processes or threads); computation is described as iterations over these blocks, and communication between blocks is defined by reusable patterns. By expressing computation in this general form, the DIY2 runtime is free to optimize the movement of blocks between slow and fast memories (disk and flash vs. DRAM) and to concurrently execute blocks residing in memory with multiple threads. This enables the same program to execute in-core, out-of-core, serial, parallel, single-threaded, multithreaded, or combinations thereof. This paper describes the implementation of the main features of the DIY2 programming model and optimizations to improve performance. DIY2 is evaluated on benchmark test cases to establish baseline performance for several common patterns and on larger complete analysis codes running on large-scale HPC machines.

  6. Application of Pfortran and Co-Array Fortran in the Parallelization of the GROMOS96 Molecular Dynamics Module

    Directory of Open Access Journals (Sweden)

    Piotr Bała

    2001-01-01

    Full Text Available After at least a decade of parallel tool development, parallelization of scientific applications remains a significant undertaking. Typically parallelization is a specialized activity supported only partially by the programming tool set, with the programmer involved with parallel issues in addition to sequential ones. The details of concern range from algorithm design down to low-level data movement details. The aim of parallel programming tools is to automate the latter without sacrificing performance and portability, allowing the programmer to focus on algorithm specification and development. We present our use of two similar parallelization tools, Pfortran and Cray's Co-Array Fortran, in the parallelization of the GROMOS96 molecular dynamics module. Our parallelization started from the GROMOS96 distribution's shared-memory implementation of the replicated algorithm, but used little of that existing parallel structure. Consequently, our parallelization was close to starting with the sequential version. We found the intuitive extensions to Pfortran and Co-Array Fortran helpful in the rapid parallelization of the project. We present performance figures for both the Pfortran and Co-Array Fortran parallelizations showing linear speedup within the range expected by these parallelization methods.

  7. Parallel computing works

    Energy Technology Data Exchange (ETDEWEB)

    1991-10-23

    An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.

  8. Parallel Simulation of Loosely Timed SystemC/TLM Programs: Challenges Raised by an Industrial Case Study

    Directory of Open Access Journals (Sweden)

    Denis Becker

    2016-05-01

    Full Text Available Transaction level models of systems-on-chip in SystemC are commonly used in the industry to provide an early simulation environment. The SystemC standard imposes coroutine semantics for the scheduling of simulated processes, to ensure determinism and reproducibility of simulations. However, because of this, sequential implementations have, for a long time, been the only option available, and still now the reference implementation is sequential. With the increasing size and complexity of models, and the multiplication of computation cores on recent machines, the parallelization of SystemC simulations is a major research concern. There have been several proposals for SystemC parallelization, but most of them are limited to cycle-accurate models. In this paper we focus on loosely timed models, which are commonly used in the industry. We present an industrial context and show that, unfortunately, most of the existing approaches for SystemC parallelization can fundamentally not apply in this context. We support this claim with a set of measurements performed on a platform used in production at STMicroelectronics. This paper surveys existing techniques, presents a visualization and profiling tool and identifies unsolved challenges in the parallelization of SystemC models at transaction level.

  9. Asymmetric Programming: A Highly Reliable Metadata Allocation Strategy for MLC NAND Flash Memory-Based Sensor Systems

    Science.gov (United States)

    Huang, Min; Liu, Zhaoqing; Qiao, Liyan

    2014-01-01

    While the NAND flash memory is widely used as the storage medium in modern sensor systems, the aggressive shrinking of process geometry and an increase in the number of bits stored in each memory cell will inevitably degrade the reliability of NAND flash memory. In particular, it's critical to enhance metadata reliability, which occupies only a small portion of the storage space, but maintains the critical information of the file system and the address translations of the storage system. Metadata damage will cause the system to crash or a large amount of data to be lost. This paper presents Asymmetric Programming, a highly reliable metadata allocation strategy for MLC NAND flash memory storage systems. Our technique exploits for the first time the property of the multi-page architecture of MLC NAND flash memory to improve the reliability of metadata. The basic idea is to keep metadata in most significant bit (MSB) pages which are more reliable than least significant bit (LSB) pages. Thus, we can achieve relatively low bit error rates for metadata. Based on this idea, we propose two strategies to optimize address mapping and garbage collection. We have implemented Asymmetric Programming on a real hardware platform. The experimental results show that Asymmetric Programming can achieve a reduction in the number of page errors of up to 99.05% with the baseline error correction scheme. PMID:25310473

  10. Asymmetric Programming: A Highly Reliable Metadata Allocation Strategy for MLC NAND Flash Memory-Based Sensor Systems

    Directory of Open Access Journals (Sweden)

    Min Huang

    2014-10-01

    Full Text Available While the NAND flash memory is widely used as the storage medium in modern sensor systems, the aggressive shrinking of process geometry and an increase in the number of bits stored in each memory cell will inevitably degrade the reliability of NAND flash memory. In particular, it’s critical to enhance metadata reliability, which occupies only a small portion of the storage space, but maintains the critical information of the file system and the address translations of the storage system. Metadata damage will cause the system to crash or a large amount of data to be lost. This paper presents Asymmetric Programming, a highly reliable metadata allocation strategy for MLC NAND flash memory storage systems. Our technique exploits for the first time the property of the multi-page architecture of MLC NAND flash memory to improve the reliability of metadata. The basic idea is to keep metadata in most significant bit (MSB pages which are more reliable than least significant bit (LSB pages. Thus, we can achieve relatively low bit error rates for metadata. Based on this idea, we propose two strategies to optimize address mapping and garbage collection. We have implemented Asymmetric Programming on a real hardware platform. The experimental results show that Asymmetric Programming can achieve a reduction in the number of page errors of up to 99.05% with the baseline error correction scheme.

  11. Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB

    Science.gov (United States)

    Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.

    2017-01-01

    Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.

  12. A low-voltage flash memory cell utilizing the gate-injection program/erase method with a recessed channel structure

    International Nuclear Information System (INIS)

    Wu Dake; Huang Ru; Wang Pengfei; Tang Poren; Wang Yangyuan

    2008-01-01

    In this paper, a low-voltage recessed channel SONOS flash memory using the gate-injection program/erase method is proposed and investigated for NAND application. It is shown that the proposed flash memory can achieve 8 V lower programming voltage compared with planar flash memory, due to the effective capacitance coupling and the electric-field enhancement by combining the recessed channel structure and the gate-injection program/erase method. In addition, more than 30% larger threshold voltage window and improved short channel effects can be obtained in the proposed flash memory

  13. Parallel R-matrix computation

    International Nuclear Information System (INIS)

    Heggarty, J.W.

    1999-06-01

    For almost thirty years, sequential R-matrix computation has been used by atomic physics research groups, from around the world, to model collision phenomena involving the scattering of electrons or positrons with atomic or molecular targets. As considerable progress has been made in the understanding of fundamental scattering processes, new data, obtained from more complex calculations, is of current interest to experimentalists. Performing such calculations, however, places considerable demands on the computational resources to be provided by the target machine, in terms of both processor speed and memory requirement. Indeed, in some instances the computational requirements are so great that the proposed R-matrix calculations are intractable, even when utilising contemporary classic supercomputers. Historically, increases in the computational requirements of R-matrix computation were accommodated by porting the problem codes to a more powerful classic supercomputer. Although this approach has been successful in the past, it is no longer considered to be a satisfactory solution due to the limitations of current (and future) Von Neumann machines. As a consequence, there has been considerable interest in the high performance multicomputers, that have emerged over the last decade which appear to offer the computational resources required by contemporary R-matrix research. Unfortunately, developing codes for these machines is not as simple a task as it was to develop codes for successive classic supercomputers. The difficulty arises from the considerable differences in the computing models that exist between the two types of machine and results in the programming of multicomputers to be widely acknowledged as a difficult, time consuming and error-prone task. Nevertheless, unless parallel R-matrix computation is realised, important theoretical and experimental atomic physics research will continue to be hindered. This thesis describes work that was undertaken in

  14. Memory-efficient dynamic programming backtrace and pairwise local sequence alignment.

    Science.gov (United States)

    Newberg, Lee A

    2008-08-15

    A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward-backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis. Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10,000. Sample C++-code for optimal backtrace is available in the Supplementary Materials. Supplementary data is available at Bioinformatics online.

  15. Parallel algorithms

    CERN Document Server

    Casanova, Henri; Robert, Yves

    2008-01-01

    ""…The authors of the present book, who have extensive credentials in both research and instruction in the area of parallelism, present a sound, principled treatment of parallel algorithms. … This book is very well written and extremely well designed from an instructional point of view. … The authors have created an instructive and fascinating text. The book will serve researchers as well as instructors who need a solid, readable text for a course on parallelism in computing. Indeed, for anyone who wants an understandable text from which to acquire a current, rigorous, and broad vi

  16. The geometric effect and programming current reduction in cylindrical-shaped phase change memory

    International Nuclear Information System (INIS)

    Li Yiming; Hwang, C-H; Li, T-Y; Cheng, H-W

    2009-01-01

    This study conducts a three-dimensional electro-thermal time-domain simulation for numerical analysis of cylindrical-shaped phase change memories (PCMs). The influence of chalcogenide material, germanium antimony telluride (GeSbTe or GST), structure on PCM operation is explored. GST with vertical structure exhibits promising characteristics. The bottom electrode contact (BEC) is advanced to improve the operation of PCMs, where a 25% reduction of the required programming current is achieved at a cost of 26% reduced resistance ratio. The position of the BEC is then shifted to further improve the performance of PCMs. The required programming current is reduced by a factor of 11, where the resistance ratio is only decreased by 6.9%. However, the PCMs with a larger shift of BEC are sensitive to process variation. To design PCMs with less than 10% programming current variation, PCMs with shifted BEC, where the shifted distance is equal to 1.5 times the BEC's radius, is worth considering. This study quantitatively estimates the structure effect on the phase transition of PCMs and physically provides an insight into the design and technology of PCMs.

  17. NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems

    Energy Technology Data Exchange (ETDEWEB)

    Lee, Seyong [ORNL; Vetter, Jeffrey S [ORNL

    2016-01-01

    Computer architecture experts expect that non-volatile memory (NVM) hierarchies will play a more significant role in future systems including mobile, enterprise, and HPC architectures. With this expectation in mind, we present NVL-C: a novel programming system that facilitates the efficient and correct programming of NVM main memory systems. The NVL-C programming abstraction extends C with a small set of intuitive language features that target NVM main memory, and can be combined directly with traditional C memory model features for DRAM. We have designed these new features to enable compiler analyses and run-time checks that can improve performance and guard against a number of subtle programming errors, which, when left uncorrected, can corrupt NVM-stored data. Moreover, to enable recovery of data across application or system failures, these NVL-C features include a flexible directive for specifying NVM transactions. So that our implementation might be extended to other compiler front ends and languages, the majority of our compiler analyses are implemented in an extended version of LLVM's intermediate representation (LLVM IR). We evaluate NVL-C on a number of applications to show its flexibility, performance, and correctness.

  18. Embedding Topical Elements of Parallel Programming, Computer Graphics, and Artificial Intelligence across the Undergraduate CS Required Courses

    Directory of Open Access Journals (Sweden)

    James Wolfer

    2015-02-01

    Full Text Available Traditionally, topics such as parallel computing, computer graphics, and artificial intelligence have been taught as stand-alone courses in the computing curriculum. Often these are elective courses, limiting the material to the subset of students choosing to take the course. Recently there has been movement to distribute topics across the curriculum in order to ensure that all graduates have been exposed to concepts such as parallel computing. Previous work described an attempt to systematically weave a tapestry of topics into the undergraduate computing curriculum. This paper reviews that work and expands it with representative examples of assignments, demonstrations, and results as well as describing how the tools and examples deployed for these classes have a residual effect on classes such as Comptuer Literacy.

  19. Logical inference techniques for loop parallelization

    KAUST Repository

    Oancea, Cosmin E.; Rauchwerger, Lawrence

    2012-01-01

    This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop's memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S = Ø, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ⇒ S = Ø). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient-independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECTCLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers. Copyright © 2012 ACM.

  20. Logical inference techniques for loop parallelization

    KAUST Repository

    Oancea, Cosmin E.

    2012-01-01

    This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop\\'s memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S = Ø, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ⇒ S = Ø). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient-independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECTCLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers. Copyright © 2012 ACM.

  1. A parallel solver for huge dense linear systems

    Science.gov (United States)

    Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.

    2011-11-01

    HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system

  2. Algorithms for computational fluid dynamics n parallel processors

    International Nuclear Information System (INIS)

    Van de Velde, E.F.

    1986-01-01

    A study of parallel algorithms for the numerical solution of partial differential equations arising in computational fluid dynamics is presented. The actual implementation on parallel processors of shared and nonshared memory design is discussed. The performance of these algorithms is analyzed in terms of machine efficiency, communication time, bottlenecks and software development costs. For elliptic equations, a parallel preconditioned conjugate gradient method is described, which has been used to solve pressure equations discretized with high order finite elements on irregular grids. A parallel full multigrid method and a parallel fast Poisson solver are also presented. Hyperbolic conservation laws were discretized with parallel versions of finite difference methods like the Lax-Wendroff scheme and with the Random Choice method. Techniques are developed for comparing the behavior of an algorithm on different architectures as a function of problem size and local computational effort. Effective use of these advanced architecture machines requires the use of machine dependent programming. It is shown that the portability problems can be minimized by introducing high level operations on vectors and matrices structured into program libraries

  3. Development of parallel benchmark code by sheet metal forming simulator 'ITAS'

    International Nuclear Information System (INIS)

    Watanabe, Hiroshi; Suzuki, Shintaro; Minami, Kazuo

    1999-03-01

    This report describes the development of parallel benchmark code by sheet metal forming simulator 'ITAS'. ITAS is a nonlinear elasto-plastic analysis program by the finite element method for the purpose of the simulation of sheet metal forming. ITAS adopts the dynamic analysis method that computes displacement of sheet metal at every time unit and utilizes the implicit method with the direct linear equation solver. Therefore the simulator is very robust. However, it requires a lot of computational time and memory capacity. In the development of the parallel benchmark code, we designed the code by MPI programming to reduce the computational time. In numerical experiments on the five kinds of parallel super computers at CCSE JAERI, i.e., SP2, SR2201, SX-4, T94 and VPP300, good performances are observed. The result will be shown to the public through WWW so that the benchmark results may become a guideline of research and development of the parallel program. (author)

  4. Massive parallel optical pattern recognition and retrieval via a two-stage high-capacity multichannel holographic random access memory system

    International Nuclear Information System (INIS)

    Cai, Luzhong; Liu, Hua-Kuang

    2000-01-01

    The multistage holographic optical random access memory (HORAM) system reported recently by Liu et al. provides a new degree of freedom for improving storage capacity. We further present a theoretical and practical analysis of the HORAM system with experimental results. Our discussions include the system design and geometrical requirements, its applications for multichannel pattern recognition and associative memory, the 2-D and 3-D information storage capacity, and multichannel image storage and retrieval via VanderLugt correlator (VLC) filters and joint transform holograms. A series of experiments are performed to demonstrate the feasibility of the multichannel pattern recognition and image retrieval with both the VLC and joint transform correlator (JTC) architectures. The experimental results with as many as 2025 channels show good agreement with the theoretical analysis. (c) 2000 Society of Photo-Optical Instrumentation Engineers

  5. Randomized controlled trial of a healthy brain ageing cognitive training program: effects on memory, mood, and sleep.

    Science.gov (United States)

    Diamond, Keri; Mowszowski, Loren; Cockayne, Nicole; Norrie, Louisa; Paradise, Matthew; Hermens, Daniel F; Lewis, Simon J G; Hickie, Ian B; Naismith, Sharon L

    2015-01-01

    With the rise in the ageing population and absence of a cure for dementia, cost-effective prevention strategies for those 'at risk' of dementia including those with depression and/or mild cognitive impairment are urgently required. This study evaluated the efficacy of a multifaceted Healthy Brain Ageing Cognitive Training (HBA-CT) program for older adults 'at risk' of dementia. Using a single-blinded design, 64 participants (mean age = 66.5 years, SD = 8.6) were randomized to an immediate treatment (HBA-CT) or treatment-as-usual control arm. The HBA-CT intervention was conducted twice-weekly for seven weeks and comprised group-based psychoeducation about cognitive strategies and modifiable lifestyle factors pertaining to healthy brain ageing, and computerized cognitive training. In comparison to the treatment-as-usual control arm, the HBA-CT program was associated with improvements in verbal memory (p = 0.03), self-reported memory (p = 0.03), mood (p = 0.01), and sleep (p = 0.01). While the improvements in memory (p = 0.03) and sleep (p = 0.02) remained after controlling for improvements in mood, only a trend in verbal memory improvement was apparent after controlling for sleep. The HBA-CT program improves cognitive, mood, and sleep functions in older adults 'at risk' of dementia, and therefore offers promise as a secondary prevention strategy.

  6. Parallelization of elliptic solver for solving 1D Boussinesq model

    Science.gov (United States)

    Tarwidi, D.; Adytia, D.

    2018-03-01

    In this paper, a parallel implementation of an elliptic solver in solving 1D Boussinesq model is presented. Numerical solution of Boussinesq model is obtained by implementing a staggered grid scheme to continuity, momentum, and elliptic equation of Boussinesq model. Tridiagonal system emerging from numerical scheme of elliptic equation is solved by cyclic reduction algorithm. The parallel implementation of cyclic reduction is executed on multicore processors with shared memory architectures using OpenMP. To measure the performance of parallel program, large number of grids is varied from 28 to 214. Two test cases of numerical experiment, i.e. propagation of solitary and standing wave, are proposed to evaluate the parallel program. The numerical results are verified with analytical solution of solitary and standing wave. The best speedup of solitary and standing wave test cases is about 2.07 with 214 of grids and 1.86 with 213 of grids, respectively, which are executed by using 8 threads. Moreover, the best efficiency of parallel program is 76.2% and 73.5% for solitary and standing wave test cases, respectively.

  7. PGHPF – An Optimizing High Performance Fortran Compiler for Distributed Memory Machines

    Directory of Open Access Journals (Sweden)

    Zeki Bozkus

    1997-01-01

    Full Text Available High Performance Fortran (HPF is the first widely supported, efficient, and portable parallel programming language for shared and distributed memory systems. HPF is realized through a set of directive-based extensions to Fortran 90. It enables application developers and Fortran end-users to write compact, portable, and efficient software that will compile and execute on workstations, shared memory servers, clusters, traditional supercomputers, or massively parallel processors. This article describes a production-quality HPF compiler for a set of parallel machines. Compilation techniques such as data and computation distribution, communication generation, run-time support, and optimization issues are elaborated as the basis for an HPF compiler implementation on distributed memory machines. The performance of this compiler on benchmark programs demonstrates that high efficiency can be achieved executing HPF code on parallel architectures.

  8. Parallelization of pressure equation solver for incompressible N-S equations

    International Nuclear Information System (INIS)

    Ichihara, Kiyoshi; Yokokawa, Mitsuo; Kaburaki, Hideo.

    1996-03-01

    A pressure equation solver in a code for 3-dimensional incompressible flow analysis has been parallelized by using red-black SOR method and PCG method on Fujitsu VPP500, a vector parallel computer with distributed memory. For the comparison of scalability, the solver using the red-black SOR method has been also parallelized on the Intel Paragon, a scalar parallel computer with a distributed memory. The scalability of the red-black SOR method on both VPP500 and Paragon was lost, when number of processor elements was increased. The reason of non-scalability on both systems is increasing communication time between processor elements. In addition, the parallelization by DO-loop division makes the vectorizing efficiency lower on VPP500. For an effective implementation on VPP500, a large scale problem which holds very long vectorized DO-loops in the parallel program should be solved. PCG method with red-black SOR method applied to incomplete LU factorization (red-black PCG) has more iteration steps than normal PCG method with forward and backward substitution, in spite of same number of the floating point operations in a DO-loop of incomplete LU factorization. The parallelized red-black PCG method has less merits than the parallelized red-black SOR method when the computational region has fewer grids, because the low vectorization efficiency is obtained in red-black PCG method. (author)

  9. A CFD Heterogeneous Parallel Solver Based on Collaborating CPU and GPU

    Science.gov (United States)

    Lai, Jianqi; Tian, Zhengyu; Li, Hua; Pan, Sha

    2018-03-01

    Since Graphic Processing Unit (GPU) has a strong ability of floating-point computation and memory bandwidth for data parallelism, it has been widely used in the areas of common computing such as molecular dynamics (MD), computational fluid dynamics (CFD) and so on. The emergence of compute unified device architecture (CUDA), which reduces the complexity of compiling program, brings the great opportunities to CFD. There are three different modes for parallel solution of NS equations: parallel solver based on CPU, parallel solver based on GPU and heterogeneous parallel solver based on collaborating CPU and GPU. As we can see, GPUs are relatively rich in compute capacity but poor in memory capacity and the CPUs do the opposite. We need to make full use of the GPUs and CPUs, so a CFD heterogeneous parallel solver based on collaborating CPU and GPU has been established. Three cases are presented to analyse the solver’s computational accuracy and heterogeneous parallel efficiency. The numerical results agree well with experiment results, which demonstrate that the heterogeneous parallel solver has high computational precision. The speedup on a single GPU is more than 40 for laminar flow, it decreases for turbulent flow, but it still can reach more than 20. What’s more, the speedup increases as the grid size becomes larger.

  10. Contribution to the algorithmic and efficient programming of new parallel architectures including accelerators for neutron physics and shielding computations

    International Nuclear Information System (INIS)

    Dubois, J.

    2011-01-01

    In science, simulation is a key process for research or validation. Modern computer technology allows faster numerical experiments, which are cheaper than real models. In the field of neutron simulation, the calculation of eigenvalues is one of the key challenges. The complexity of these problems is such that a lot of computing power may be necessary. The work of this thesis is first the evaluation of new computing hardware such as graphics card or massively multi-core chips, and their application to eigenvalue problems for neutron simulation. Then, in order to address the massive parallelism of supercomputers national, we also study the use of asynchronous hybrid methods for solving eigenvalue problems with this very high level of parallelism. Then we experiment the work of this research on several national supercomputers such as the Titane hybrid machine of the Computing Center, Research and Technology (CCRT), the Curie machine of the Very Large Computing Centre (TGCC), currently being installed, and the Hopper machine at the Lawrence Berkeley National Laboratory (LBNL). We also do our experiments on local workstations to illustrate the interest of this research in an everyday use with local computing resources. (author) [fr

  11. Parallelization of the FLAPW method

    International Nuclear Information System (INIS)

    Canning, A.; Mannstadt, W.; Freeman, A.J.

    1999-01-01

    The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining electronic and magnetic properties of crystals and surfaces. Until the present work, the FLAPW method has been limited to systems of less than about one hundred atoms due to a lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell, running on up to 512 processors on a CRAY T3E parallel computer

  12. Parallelization of the FLAPW method

    Science.gov (United States)

    Canning, A.; Mannstadt, W.; Freeman, A. J.

    2000-08-01

    The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining structural, electronic and magnetic properties of crystals and surfaces. Until the present work, the FLAPW method has been limited to systems of less than about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work, we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell, running on up to 512 processors on a CRAY T3E parallel supercomputer.

  13. Program scheme using common source lines in channel stacked NAND flash memory with layer selection by multilevel operation

    Science.gov (United States)

    Kim, Do-Bin; Kwon, Dae Woong; Kim, Seunghyun; Lee, Sang-Ho; Park, Byung-Gook

    2018-02-01

    To obtain high channel boosting potential and reduce a program disturbance in channel stacked NAND flash memory with layer selection by multilevel (LSM) operation, a new program scheme using boosted common source line (CSL) is proposed. The proposed scheme can be achieved by applying proper bias to each layer through its own CSL. Technology computer-aided design (TCAD) simulations are performed to verify the validity of the new method in LSM. Through TCAD simulation, it is revealed that the program disturbance characteristics is effectively improved by the proposed scheme.

  14. Design and Programming for Cable-Driven Parallel Robots in the German Pavilion at the EXPO 2015

    Directory of Open Access Journals (Sweden)

    Philipp Tempel

    2015-08-01

    Full Text Available In the German Pavilion at the EXPO 2015, two large cable-driven parallel robots are flying over the heads of the visitors representing two bees flying over Germany and displaying everyday life in Germany. Each robot consists of a mobile platform and eight cables suspended by winches and follows a desired trajectory, which needs to be computed in advance taking technical limitations, safety considerations and visual aspects into account. In this paper, a path planning software is presented, which includes the design process from developing a robot design and workspace estimation via planning complex trajectories considering technical limitations through to exporting a complete show. For a test trajectory, simulation results are given, which display the relevant trajectories and cable force distributions.

  15. Preliminary Study on the Enhancement of Reconstruction Speed for Emission Computed Tomography Using Parallel Processing

    International Nuclear Information System (INIS)

    Park, Min Jae; Lee, Jae Sung; Kim, Soo Mee; Kang, Ji Yeon; Lee, Dong Soo; Park, Kwang Suk

    2009-01-01

    Conventional image reconstruction uses simplified physical models of projection. However, real physics, for example 3D reconstruction, takes too long time to process all the data in clinic and is unable in a common reconstruction machine because of the large memory for complex physical models. We suggest the realistic distributed memory model of fast-reconstruction using parallel processing on personal computers to enable large-scale technologies. The preliminary tests for the possibility on virtual machines and various performance test on commercial super computer, Tachyon were performed. Expectation maximization algorithm with common 2D projection and realistic 3D line of response were tested. Since the process time was getting slower (max 6 times) after a certain iteration, optimization for compiler was performed to maximize the efficiency of parallelization. Parallel processing of a program on multiple computers was available on Linux with MPICH and NFS. We verified that differences between parallel processed image and single processed image at the same iterations were under the significant digits of floating point number, about 6 bit. Double processors showed good efficiency (1.96 times) of parallel computing. Delay phenomenon was solved by vectorization method using SSE. Through the study, realistic parallel computing system in clinic was established to be able to reconstruct by plenty of memory using the realistic physical models which was impossible to simplify

  16. Parallel computing: numerics, applications, and trends

    National Research Council Canada - National Science Library

    Trobec, Roman; Vajteršic, Marián; Zinterhof, Peter

    2009-01-01

    ... and/or distributed systems. The contributions to this book are focused on topics most concerned in the trends of today's parallel computing. These range from parallel algorithmics, programming, tools, network computing to future parallel computing. Particular attention is paid to parallel numerics: linear algebra, differential equations, numerica...

  17. MEDUSA - An overset grid flow solver for network-based parallel computer systems

    Science.gov (United States)

    Smith, Merritt H.; Pallis, Jani M.

    1993-01-01

    Continuing improvement in processing speed has made it feasible to solve the Reynolds-Averaged Navier-Stokes equations for simple three-dimensional flows on advanced workstations. Combining multiple workstations into a network-based heterogeneous parallel computer allows the application of programming principles learned on MIMD (Multiple Instruction Multiple Data) distributed memory parallel computers to the solution of larger problems. An overset-grid flow solution code has been developed which uses a cluster of workstations as a network-based parallel computer. Inter-process communication is provided by the Parallel Virtual Machine (PVM) software. Solution speed equivalent to one-third of a Cray-YMP processor has been achieved from a cluster of nine commonly used engineering workstation processors. Load imbalance and communication overhead are the principal impediments to parallel efficiency in this application.

  18. Deadline-aware scheduling for Software Transactional Memory

    DEFF Research Database (Denmark)

    Maldonado, Walter; Marlier, Patrick; Felber, Pascal

    2011-01-01

    Software Transactional Memory (STM) is an optimistic concurrency control mechanism that simplifies the development of parallel programs. Still, the interest of STM has not yet been demonstrated for reactive applications that require bounded response time for some of their operations. We propose...

  19. An Investigation of Unified Memory Access Performance in CUDA

    Science.gov (United States)

    Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K.; Herbordt, Martin

    2015-01-01

    Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations. PMID:26594668

  20. On the Parallel Elliptic Single/Multigrid Solutions about Aligned and Nonaligned Bodies Using the Virtual Machine for Multiprocessors

    Directory of Open Access Journals (Sweden)

    A. Averbuch

    1994-01-01

    Full Text Available Parallel elliptic single/multigrid solutions around an aligned and nonaligned body are presented and implemented on two multi-user and single-user shared memory multiprocessors (Sequent Symmetry and MOS and on a distributed memory multiprocessor (a Transputer network. Our parallel implementation uses the Virtual Machine for Muli-Processors (VMMP, a software package that provides a coherent set of services for explicitly parallel application programs running on diverse multiple instruction multiple data (MIMD multiprocessors, both shared memory and message passing. VMMP is intended to simplify parallel program writing and to promote portable and efficient programming. Furthermore, it ensures high portability of application programs by implementing the same services on all target multiprocessors. The performance of our algorithm is investigated in detail. It is seen to fit well the above architectures when the number of processors is less than the maximal number of grid points along the axes. In general, the efficiency in the nonaligned case is higher than in the aligned case. Alignment overhead is observed to be up to 200% in the shared-memory case and up to 65% in the message-passing case. We have demonstrated that when using VMMP, the portability of the algorithms is straightforward and efficient.

  1. Main Memory

    OpenAIRE

    Boncz, Peter; Liu, Lei; Özsu, M.

    2008-01-01

    htmlabstractPrimary storage, presently known as main memory, is the largest memory directly accessible to the CPU in the prevalent Von Neumann model and stores both data and instructions (program code). The CPU continuously reads instructions stored there and executes them. It is also called Random Access Memory (RAM), to indicate that load/store instructions can access data at any location at the same cost, is usually implemented using DRAM chips, which are connected to the CPU and other per...

  2. An A.P.L. micro-programmed machine: implementation on a Multi-20 mini-computer, memory organization, micro-programming and flowcharts

    International Nuclear Information System (INIS)

    Granger, Jean-Louis

    1975-01-01

    This work deals with the presentation of an APL interpreter implemented on an MULTI 20 mini-computer. It includes a left to right syntax analyser, a recursive routine for generation and execution. This routine uses a beating method for array processing. Moreover, during the execution of all APL statements, dynamic memory allocation is used. Execution of basic operations has been micro-programmed. The basic APL interpreter has a length of 10 K bytes. It uses overlay methods. (author) [fr

  3. Final Report, Center for Programming Models for Scalable Parallel Computing: Co-Array Fortran, Grant Number DE-FC02-01ER25505

    Energy Technology Data Exchange (ETDEWEB)

    Robert W. Numrich

    2008-04-22

    The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to

  4. A Novel Parallel Algorithm for Edit Distance Computation

    Directory of Open Access Journals (Sweden)

    Muhammad Murtaza Yousaf

    2018-01-01

    Full Text Available The edit distance between two sequences is the minimum number of weighted transformation-operations that are required to transform one string into the other. The weighted transformation-operations are insert, remove, and substitute. Dynamic programming solution to find edit distance exists but it becomes computationally intensive when the lengths of strings become very large. This work presents a novel parallel algorithm to solve edit distance problem of string matching. The algorithm is based on resolving dependencies in the dynamic programming solution of the problem and it is able to compute each row of edit distance table in parallel. In this way, it becomes possible to compute the complete table in min(m,n iterations for strings of size m and n whereas state-of-the-art parallel algorithm solves the problem in max(m,n iterations. The proposed algorithm also increases the amount of parallelism in each of its iteration. The algorithm is also capable of exploiting spatial locality while its implementation. Additionally, the algorithm works in a load balanced way that further improves its performance. The algorithm is implemented for multicore systems having shared memory. Implementation of the algorithm in OpenMP shows linear speedup and better execution time as compared to state-of-the-art parallel approach. Efficiency of the algorithm is also proven better in comparison to its competitor.

  5. Main Memory

    NARCIS (Netherlands)

    P.A. Boncz (Peter); L. Liu (Lei); M. Tamer Özsu

    2008-01-01

    htmlabstractPrimary storage, presently known as main memory, is the largest memory directly accessible to the CPU in the prevalent Von Neumann model and stores both data and instructions (program code). The CPU continuously reads instructions stored there and executes them. It is also called Random

  6. Barkley's Parent Training Program, Working Memory Training and their Combination for Children with ADHD: Attention Deficit Hyperactivity Disorder.

    Directory of Open Access Journals (Sweden)

    Zahra Hosainzadeh Maleki

    2014-06-01

    Full Text Available The aim of the current study was to examine the effectiveness of Barkley's parent training program, working memory training and the combination of these two interventions for children with Attention deficit hyperactivity disorder (ADHD.In this study, 36 participants with ADHD (aged 6 to 12 years were selected by convenience sampling. Revision of the Swanson, Nolan and Pelham (SNAP questionnaire (SNAP-IV, Child Behavior Checklist (CBCL and clinical interviews were employed to diagnose ADHD. Wechsler Intelligence Scale for Children-Fourth Edition was also implemented. The participants were randomly assigned to the three intervention groups of Barkley's parent training program, working memory training and the combined group. SNAP-IV and CBCL were used as pre-tests and post-tests across all three groups. Data were analyzed using MANCOVA (SPSS version18.There was a significant difference (p< 0.05 in the decline of attention deficit and hyperactivity /impulsivity symptoms between the combined treatment group and working memory training group and also between the combined treatment group and the parent training group in SNAP. In terms of attention problems (experience-based subscales of CBCL, there was a significant difference (p< 0.001 between the combined treatment group and working memory training group. Furthermore, compared to the working memory training and parent training groups, the combined group demonstrated a significant decline (p< 0.01 in clinical symptoms of ADHD (based on DSM.It was revealed that combined treatment in comparison with the other two methods suppressed the clinical symptoms of ADHD more significantly.

  7. Rubus: A compiler for seamless and extensible parallelism

    Science.gov (United States)

    Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor

    2017-01-01

    Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been

  8. A multitransputer parallel processing system (MTPPS)

    International Nuclear Information System (INIS)

    Jethra, A.K.; Pande, S.S.; Borkar, S.P.; Khare, A.N.; Ghodgaonkar, M.D.; Bairi, B.R.

    1993-01-01

    This report describes the design and implementation of a 16 node Multi Transputer Parallel Processing System(MTPPS) which is a platform for parallel program development. It is a MIMD machine based on message passing paradigm. The basic compute engine is an Inmos Transputer Ims T800-20. Transputer with local memory constitutes the processing element (NODE) of this MIMD architecture. Multiple NODES can be connected to each other in an identifiable network topology through the high speed serial links of the transputer. A Network Configuration Unit (NCU) incorporates the necessary hardware to provide software controlled network configuration. System is modularly expandable and more NODES can be added to the system to achieve the required processing power. The system is backend to the IBM-PC which has been integrated into the system to provide user I/O interface. PC resources are available to the programmer. Interface hardware between the PC and the network of transputers is INMOS compatible. Therefore, all the commercially available development software compatible to INMOS products can run on this system. While giving the details of design and implementation, this report briefly summarises MIMD Architectures, Transputer Architecture and Parallel Processing Software Development issues. LINPACK performance evaluation of the system and solutions of neutron physics and plasma physics problem have been discussed along with results. (author). 12 refs., 22 figs., 3 tabs., 3 appendixes

  9. Parallel Algorithms for Graph Optimization using Tree Decompositions

    Energy Technology Data Exchange (ETDEWEB)

    Sullivan, Blair D [ORNL; Weerapurage, Dinesh P [ORNL; Groer, Christopher S [ORNL

    2012-06-01

    Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.

  10. Effects of a School-Based Instrumental Music Program on Verbal and Visual Memory in Primary School Children: A Longitudinal Study

    OpenAIRE

    Roden, Ingo; Kreutz, Gunter; Bongard, Stephan

    2012-01-01

    This study examined the effects of a school-based instrumental training program on the development of verbal and visual memory skills in primary school children. Participants either took part in a music program with weekly 45 minutes sessions of instrumental lessons in small groups at school, or they received extended natural science training. A third group of children did not receive additional training. Each child completed verbal and visual memory tests for three times over a period of 18 ...

  11. Effects of a school-based instrumental music program on verbal and visual memory in primary school children: a longitudinal study

    Directory of Open Access Journals (Sweden)

    Ingo eRoden

    2012-12-01

    Full Text Available This study examined the effects of a school-based instrumental training program on the development of verbal and visual memory skills in primary school children. Participants either took part in a music program with weekly 45 minutes sessions of instrumental lessons in small groups at school, or they received extended natural science training. A third group of children did not receive additional training. Each child completed verbal and visual memory tests for three times over a period of 18 months. Significant Group by Time interactions were found in the measures of verbal memory. Children in the music group showed greater improvements than children in the control groups after controlling for children's socio-economic background, age and IQ. No differences between groups were found in the visual memory tests. These findings are consistent with and extend previous research by suggesting that children receiving music training may benefit from improvements in their verbal memory skills.

  12. Effects of a school-based instrumental music program on verbal and visual memory in primary school children: a longitudinal study.

    Science.gov (United States)

    Roden, Ingo; Kreutz, Gunter; Bongard, Stephan

    2012-01-01

    This study examined the effects of a school-based instrumental training program on the development of verbal and visual memory skills in primary school children. Participants either took part in a music program with weekly 45 min sessions of instrumental lessons in small groups at school, or they received extended natural science training. A third group of children did not receive additional training. Each child completed verbal and visual memory tests three times over a period of 18 months. Significant Group by Time interactions were found in the measures of verbal memory. Children in the music group showed greater improvements than children in the control groups after controlling for children's socio-economic background, age, and IQ. No differences between groups were found in the visual memory tests. These findings are consistent with and extend previous research by suggesting that children receiving music training may benefit from improvements in their verbal memory skills.

  13. Effects of a School-Based Instrumental Music Program on Verbal and Visual Memory in Primary School Children: A Longitudinal Study

    Science.gov (United States)

    Roden, Ingo; Kreutz, Gunter; Bongard, Stephan

    2012-01-01

    This study examined the effects of a school-based instrumental training program on the development of verbal and visual memory skills in primary school children. Participants either took part in a music program with weekly 45 min sessions of instrumental lessons in small groups at school, or they received extended natural science training. A third group of children did not receive additional training. Each child completed verbal and visual memory tests three times over a period of 18 months. Significant Group by Time interactions were found in the measures of verbal memory. Children in the music group showed greater improvements than children in the control groups after controlling for children’s socio-economic background, age, and IQ. No differences between groups were found in the visual memory tests. These findings are consistent with and extend previous research by suggesting that children receiving music training may benefit from improvements in their verbal memory skills. PMID:23267341

  14. Holographic associative memories in document retrieval systems

    International Nuclear Information System (INIS)

    Becker, P.J.; Bolle, H.; Keller, A.; Kistner, W.; Riecke, W.D.; Wagner, U.

    1979-03-01

    The objective of this work was the implementation of a holographic memory with associative readout for a document retrieval system. Taking advantage of the favourable properties of holography - associative readout of the memory, parallel processing in the response store - may give shorter response times than sequentially organized data memories. Such a system may also operate in the interactive mode including chain associations. In order to avoid technological difficulties, the experimental setup made use of commercially available components only. As a result an improved holographic structure is proposed which uses volume holograms in photorefractive crystals as storage device. In two chapters of appendix we give a review of the state of the art of electrooptic devices for coherent optical data processing and of competing technologies (semiconductor associative memories and associative program systems). (orig.) [de

  15. Study Trapped Charge Distribution in P-Channel Silicon-Oxide-Nitride-Oxide-Silicon Memory Device Using Dynamic Programming Scheme

    Science.gov (United States)

    Li, Fu-Hai; Chiu, Yung-Yueh; Lee, Yen-Hui; Chang, Ru-Wei; Yang, Bo-Jun; Sun, Wein-Town; Lee, Eric; Kuo, Chao-Wei; Shirota, Riichiro

    2013-04-01

    In this study, we precisely investigate the charge distribution in SiN layer by dynamic programming of channel hot hole induced hot electron injection (CHHIHE) in p-channel silicon-oxide-nitride-oxide-silicon (SONOS) memory device. In the dynamic programming scheme, gate voltage is increased as a staircase with fixed step amplitude, which can prohibits the injection of holes in SiN layer. Three-dimensional device simulation is calibrated and is compared with the measured programming characteristics. It is found, for the first time, that the hot electron injection point quickly traverses from drain to source side synchronizing to the expansion of charged area in SiN layer. As a result, the injected charges quickly spread over on the almost whole channel area uniformly during a short programming period, which will afford large tolerance against lateral trapped charge diffusion by baking.

  16. Comparisons of Energy Management Methods for a Parallel Plug-In Hybrid Electric Vehicle between the Convex Optimization and Dynamic Programming

    Directory of Open Access Journals (Sweden)

    Renxin Xiao

    2018-01-01

    Full Text Available This paper proposes a comparison study of energy management methods for a parallel plug-in hybrid electric vehicle (PHEV. Based on detailed analysis of the vehicle driveline, quadratic convex functions are presented to describe the nonlinear relationship between engine fuel-rate and battery charging power at different vehicle speed and driveline power demand. The engine-on power threshold is estimated by the simulated annealing (SA algorithm, and the battery power command is achieved by convex optimization with target of improving fuel economy, compared with the dynamic programming (DP based method and the charging depleting–charging sustaining (CD/CS method. In addition, the proposed control methods are discussed at different initial battery state of charge (SOC values to extend the application. Simulation results validate that the proposed strategy based on convex optimization can save the fuel consumption and reduce the computation burden obviously.

  17. System-Enforced Deterministic Streaming for Efficient Pipeline Parallelism

    Institute of Scientific and Technical Information of China (English)

    张昱; 李兆鹏; 曹慧芳

    2015-01-01

    Pipeline parallelism is a popular parallel programming pattern for emerging applications. However, program-ming pipelines directly on conventional multithreaded shared memory is difficult and error-prone. We present DStream, a C library that provides high-level abstractions of deterministic threads and streams for simply representing pipeline stage work-ers and their communications. The deterministic stream is established atop our proposed single-producer/multi-consumer (SPMC) virtual memory, which integrates synchronization with the virtual memory model to enforce determinism on shared memory accesses. We investigate various strategies on how to efficiently implement DStream atop the SPMC memory, so that an infinite sequence of data items can be asynchronously published (fixed) and asynchronously consumed in order among adjacent stage workers. We have successfully transformed two representative pipeline applications – ferret and dedup using DStream, and conclude conversion rules. An empirical evaluation shows that the converted ferret performed on par with its Pthreads and TBB counterparts in term of running time, while the converted dedup is close to 2.56X, 7.05X faster than the Pthreads counterpart and 1.06X, 3.9X faster than the TBB counterpart on 16 and 32 CPUs, respectively.

  18. Patterns for Parallel Software Design

    CERN Document Server

    Ortega-Arjona, Jorge Luis

    2010-01-01

    Essential reading to understand patterns for parallel programming Software patterns have revolutionized the way we think about how software is designed, built, and documented, and the design of parallel software requires you to consider other particular design aspects and special skills. From clusters to supercomputers, success heavily depends on the design skills of software developers. Patterns for Parallel Software Design presents a pattern-oriented software architecture approach to parallel software design. This approach is not a design method in the classic sense, but a new way of managin

  19. Pthreads vs MPI Parallel Performance of Angular-Domain Decomposed S

    International Nuclear Information System (INIS)

    Azmy, Y.Y.; Barnett, D.A.

    2000-01-01

    Two programming models for parallelizing the Angular Domain Decomposition (ADD) of the discrete ordinates (S n ) approximation of the neutron transport equation are examined. These are the shared memory model based on the POSIX threads (Pthreads) standard, and the message passing model based on the Message Passing Interface (MPI) standard. These standard libraries are available on most multiprocessor platforms thus making the resulting parallel codes widely portable. The question is: on a fixed platform, and for a particular code solving a given test problem, which of the two programming models delivers better parallel performance? Such comparison is possible on Symmetric Multi-Processors (SMP) architectures in which several CPUs physically share a common memory, and in addition are capable of emulating message passing functionality. Implementation of the two-dimensional,(S n ), Arbitrarily High Order Transport (AHOT) code for solving neutron transport problems using these two parallelization models is described. Measured parallel performance of each model on the COMPAQ AlphaServer 8400 and the SGI Origin 2000 platforms is described, and comparison of the observed speedup for the two programming models is reported. For the case presented in this paper it appears that the MPI implementation scales better than the Pthreads implementation on both platforms

  20. Distributed parallel messaging for multiprocessor systems

    Science.gov (United States)

    Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka

    2013-06-04

    A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.

  1. Efficient Parallel Kernel Solvers for Computational Fluid Dynamics Applications

    Science.gov (United States)

    Sun, Xian-He

    1997-01-01

    Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as Intel Paragon, IBM SP2, and Cray Origin2OO, have successfully delivered high performance computing power for solving some of the so-called "grand-challenge" problems. Despite initial success, parallel machines have not been widely accepted in production engineering environments due to the complexity of parallel programming. On a parallel computing system, a task has to be partitioned and distributed appropriately among processors to reduce communication cost and to attain load balance. More importantly, even with careful partitioning and mapping, the performance of an algorithm may still be unsatisfactory, since conventional sequential algorithms may be serial in nature and may not be implemented efficiently on parallel machines. In many cases, new algorithms have to be introduced to increase parallel performance. In order to achieve optimal performance, in addition to partitioning and mapping, a careful performance study should be conducted for a given application to find a good algorithm-machine combination. This process, however, is usually painful and elusive. The goal of this project is to design and develop efficient parallel algorithms for highly accurate Computational Fluid Dynamics (CFD) simulations and other engineering applications. The work plan is 1) developing highly accurate parallel numerical algorithms, 2) conduct preliminary testing to verify the effectiveness and potential of these algorithms, 3) incorporate newly developed algorithms into actual simulation packages. The work plan has well achieved. Two highly accurate, efficient Poisson solvers have been developed and tested based on two different approaches: (1) Adopting a mathematical geometry which has a better capacity to describe the fluid, (2) Using compact scheme to gain high order accuracy in numerical discretization. The previously developed Parallel Diagonal Dominant (PDD) algorithm

  2. Jordan's 2002 to 2012 Fertility Stall and Parallel USAID Investments in Family Planning: Lessons From an Assessment to Guide Future Programming.

    Science.gov (United States)

    Spindler, Esther; Bitar, Nisreen; Solo, Julie; Menstell, Elizabeth; Shattuck, Dominick

    2017-12-28

    Health practitioners, researchers, and donors are stumped about Jordan's stalled fertility rate, which has stagnated between 3.7 and 3.5 children per woman from 2002 to 2012, above the national replacement level of 2.1. This stall paralleled United States Agency for International Development (USAID) funding investments in family planning in Jordan, triggering an assessment of USAID family planning programming in Jordan. This article describes the methods, results, and implications of the programmatic assessment. Methods included an extensive desk review of USAID programs in Jordan and 69 interviews with reproductive health stakeholders. We explored reasons for fertility stagnation in Jordan's total fertility rate (TFR) and assessed the effects of USAID programming on family planning outcomes over the same time period. The assessment results suggest that the increased use of less effective methods, in particular withdrawal and condoms, are contributing to Jordan's TFR stall. Jordan's limited method mix, combined with strong sociocultural determinants around reproduction and fertility desires, have contributed to low contraceptive effectiveness in Jordan. Over the same time period, USAID contributions toward increasing family planning access and use, largely focused on service delivery programs, were extensive. Examples of effective initiatives, among others, include task shifting of IUD insertion services to midwives due to a shortage of female physicians. However, key challenges to improved use of family planning services include limited government investments in family planning programs, influential service provider behaviors and biases that limit informed counseling and choice, pervasive strong social norms of family size and fertility, and limited availability of different contraceptive methods. In contexts where sociocultural norms and a limited method mix are the dominant barriers toward improved family planning use, increased national government investments

  3. Low-voltage high-speed programming gate-all-around floating gate memory cell with tunnel barrier engineering

    Science.gov (United States)

    Hamzah, Afiq; Ezaila Alias, N.; Ismail, Razali

    2018-06-01

    The aim of this study is to investigate the memory performances of gate-all-around floating gate (GAA-FG) memory cell implementing engineered tunnel barrier concept of variable oxide thickness (VARIOT) of low-k/high-k for several high-k (i.e., Si3N4, Al2O3, HfO2, and ZrO2) with low-k SiO2 using three-dimensional (3D) simulator Silvaco ATLAS. The simulation work is conducted by initially determining the optimized thickness of low-k/high-k barrier-stacked and extracting their Fowler–Nordheim (FN) coefficients. Based on the optimized parameters the device performances of GAA-FG for fast program operation and data retention are assessed using benchmark set by 6 and 8 nm SiO2 tunnel layer respectively. The programming speed has been improved and wide memory window with 30% increment from conventional SiO2 has been obtained using SiO2/Al2O3 tunnel layer due to its thin low-k dielectric thickness. Furthermore, given its high band edges only 1% of charge-loss is expected after 10 years of ‑3.6/3.6 V gate stress.

  4. The parallel adult education system

    DEFF Research Database (Denmark)

    Wahlgren, Bjarne

    2015-01-01

    for competence development. The Danish university educational system includes two parallel programs: a traditional academic track (candidatus) and an alternative practice-based track (master). The practice-based program was established in 2001 and organized as part time. The total program takes half the time...

  5. Working memory training and poetry-based stimulation programs: are there differences in cognitive outcome in healthy older adults?

    Science.gov (United States)

    Zimmermann, Nicolle; Netto, Tania Maria; Amodeo, Maria Teresa; Ska, Bernadette; Fonseca, Rochele Paz

    2014-01-01

    Neuropsychological interventions have been mainly applied with clinical populations, in spite of the need of preventing negative changes across life span. Among the few studies of cognitive stimulation in elderly, surprisingly there is no enough research comparing direct and indirect active stimulation programs. This study aims to verify wheter there are differences between two cognitive interventions approaches in older adults: a structured Working Memory (WM) Training Program versus a Poetry-based Stimulation Program. Fourteen older adults were randomly assigned to participate into one of the two intervention groups. The assessed neurocognitive components were attention, episodic and working memory, communicative and executive functions. WM Training activities were based on Baddeley's model; Poetry-based Stimulation Program was composed by general language activities. Data were analyzed with one-way ANCOVA with Delta scores and pre and post-training tests raw scores. WM group improved performance on WM, inhibition, and cognitive flexibility measures, while Poetry group improved on verbal fluency and narrative discourse tasks. Both approaches presented benefits; however WM Training improved its target function with transfer effects to executive functions, being useful for future studies with a variety of dementias. Poetry-based Stimulation also improved complex linguistic abilities. Both approaches may be helpful as strategies to prevent dysfunctional aging changes.

  6. Cache-aware data structure model for parallelism and dynamic load balancing

    International Nuclear Information System (INIS)

    Sridi, Marwa

    2016-01-01

    This PhD thesis is dedicated to the implementation of innovative parallel methods in the framework of fast transient fluid-structure dynamics. It improves existing methods within EUROPLEXUS software, in order to optimize the shared memory parallel strategy, complementary to the original distributed memory approach, brought together into a global hybrid strategy for clusters of multi-core nodes. Starting from a sound analysis of the state of the art concerning data structuring techniques correlated to the hierarchic memory organization of current multi-processor architectures, the proposed work introduces an approach suitable for an explicit time integration (i.e. with no linear system to solve at each step). A data structure of type 'Structure of arrays' is conserved for the global data storage, providing flexibility and efficiency for current operations on kinematics fields (displacement, velocity and acceleration). On the contrary, in the particular case of elementary operations (for internal forces generic computations, as well as fluxes computations between cell faces for fluid models), particularly time consuming but localized in the program, a temporary data structure of type 'Array of structures' is used instead, to force an efficient filling of the cache memory and increase the performance of the resolution, for both serial and shared memory parallel processing. Switching from the global structure to the temporary one is based on a cell grouping strategy, following classing cache-blocking principles but handling specifically for this work neighboring data necessary to the efficient treatment of ALE fluxes for cells on the group boundaries. The proposed approach is extensively tested, from the point of views of both the computation time and the access failures into cache memory, confronting the gains obtained within the elementary operations to the potential overhead generated by the data structure switch. Obtained results are very satisfactory, especially

  7. Comparison of some parallelization strategies of thermalhydraulic codes on GPUs

    International Nuclear Information System (INIS)

    Jendoubi, T.; Bergeaud, V.; Geay, A.

    2013-01-01

    Modern supercomputers architecture is now often based on hybrid concepts combining parallelism to distributed memory, parallelism to shared memory and also to GPUs (Graphic Process Units). In this work, we propose a new approach to take advantage of these graphic cards in thermohydraulics algorithms. (authors)

  8. Parallel Lines

    Directory of Open Access Journals (Sweden)

    James G. Worner

    2017-05-01

    Full Text Available James Worner is an Australian-based writer and scholar currently pursuing a PhD at the University of Technology Sydney. His research seeks to expose masculinities lost in the shadow of Australia’s Anzac hegemony while exploring new opportunities for contemporary historiography. He is the recipient of the Doctoral Scholarship in Historical Consciousness at the university’s Australian Centre of Public History and will be hosted by the University of Bologna during 2017 on a doctoral research writing scholarship.   ‘Parallel Lines’ is one of a collection of stories, The Shapes of Us, exploring liminal spaces of modern life: class, gender, sexuality, race, religion and education. It looks at lives, like lines, that do not meet but which travel in proximity, simultaneously attracted and repelled. James’ short stories have been published in various journals and anthologies.

  9. Mental models of adherence: parallels in perceptions, values, and expectations in adherence to prescribed home exercise programs and other personal regimens.

    Science.gov (United States)

    Rizzo, Jon; Bell, Alexandra

    2018-05-09

    A mental model is the collection of an individual's perceptions, values, and expectations about a particular aspect of their life, which strongly influences behaviors. This study explored orthopedic outpatients mental models of adherence to prescribed home exercise programs and how they related to mental models of adherence to other types of personal regimens. The study followed an interpretive description qualitative design. Data were collected via two semi-structured interviews. Interview One focused on participants prior experiences adhering to personal regimens. Interview Two focused on experiences adhering to their current prescribed home exercise program. Data analysis followed a constant comparative method. Findings revealed similarity in perceptions, values, and expectations that informed individuals mental models of adherence to personal regimens and prescribed home exercise programs. Perceived realized results, expected results, perceived social supports, and value of convenience characterized mental models of adherence. Parallels between mental models of adherence for prescribed home exercise and other personal regimens suggest that patients adherence behavior to prescribed routines may be influenced by adherence experiences in other aspects of their lives. By gaining insight into patients adherence experiences, values, and expectations across life domains, clinicians may tailor supports that enhance home exercise adherence. Implications for Rehabilitation A mental model is the collection of an individual's perceptions, values, and expectations about a particular aspect of their life, which is based on prior experiences and strongly influences behaviors. This study demonstrated similarity in orthopedic outpatients mental models of adherence to prescribed home exercise programs and adherence to personal regimens in other aspects of their lives. Physical therapists should inquire about patients non-medical adherence experiences, as strategies patients

  10. Routing performance analysis and optimization within a massively parallel computer

    Science.gov (United States)

    Archer, Charles Jens; Peters, Amanda; Pinnow, Kurt Walter; Swartz, Brent Allen

    2013-04-16

    An apparatus, program product and method optimize the operation of a massively parallel computer system by, in part, receiving actual performance data concerning an application executed by the plurality of interconnected nodes, and analyzing the actual performance data to identify an actual performance pattern. A desired performance pattern may be determined for the application, and an algorithm may be selected from among a plurality of algorithms stored within a memory, the algorithm being configured to achieve the desired performance pattern based on the actual performance data.

  11. Particle orbit tracking on a parallel computer: Hypertrack

    International Nuclear Information System (INIS)

    Cole, B.; Bourianoff, G.; Pilat, F.; Talman, R.

    1991-05-01

    A program has been written which performs particle orbit tracking on the Intel iPSC/860 distributed memory parallel computer. The tracking is performed using a thin element approach. A brief description of the structure and performance of the code is presented, along with applications of the code to the analysis of accelerator lattices for the SSC. The concept of ''ensemble tracking'', i.e. the tracking of ensemble averages of noninteracting particles, such as the emittance, is presented. Preliminary results of such studies will be presented. 2 refs., 6 figs

  12. Internal combustion engine control for series hybrid electric vehicles by parallel and distributed genetic programming/multiobjective genetic algorithms

    Science.gov (United States)

    Gladwin, D.; Stewart, P.; Stewart, J.

    2011-02-01

    This article addresses the problem of maintaining a stable rectified DC output from the three-phase AC generator in a series-hybrid vehicle powertrain. The series-hybrid prime power source generally comprises an internal combustion (IC) engine driving a three-phase permanent magnet generator whose output is rectified to DC. A recent development has been to control the engine/generator combination by an electronically actuated throttle. This system can be represented as a nonlinear system with significant time delay. Previously, voltage control of the generator output has been achieved by model predictive methods such as the Smith Predictor. These methods rely on the incorporation of an accurate system model and time delay into the control algorithm, with a consequent increase in computational complexity in the real-time controller, and as a necessity relies to some extent on the accuracy of the models. Two complementary performance objectives exist for the control system. Firstly, to maintain the IC engine at its optimal operating point, and secondly, to supply a stable DC supply to the traction drive inverters. Achievement of these goals minimises the transient energy storage requirements at the DC link, with a consequent reduction in both weight and cost. These objectives imply constant velocity operation of the IC engine under external load disturbances and changes in both operating conditions and vehicle speed set-points. In order to achieve these objectives, and reduce the complexity of implementation, in this article a controller is designed by the use of Genetic Programming methods in the Simulink modelling environment, with the aim of obtaining a relatively simple controller for the time-delay system which does not rely on the implementation of real time system models or time delay approximations in the controller. A methodology is presented to utilise the miriad of existing control blocks in the Simulink libraries to automatically evolve optimal control

  13. A data base for on-line event analysis on a distributed memory machine

    International Nuclear Information System (INIS)

    Argante, E.; Meesters, M.R.J.; Willers, I.; Stok, P. van der

    1996-01-01

    Parallel in-memory databases can enhance the structuring and parallelization of programs used in High Energy Physics (HEP). Efficient database access routines are used as communication primitives which hide the communication topology in contrast to the more explicit communications like PVM or MPI. A parallel in-memory database, called SPIDER, has been implemented on a 32 node Meiko CS-2 distributed memory machine. The SPIDER primitives generate a lower overhead than the one generated by PVM or MPI. The even reconstruction program, CPREAD, of the CLEAR experiment, has been used as test case. Performance measurements showed that CPREAD interfaced to SPIDER can easily cope with the event rate generated by CPLEAR. (author)

  14. An efficient parallel algorithm: Poststack and prestack Kirchhoff 3D depth migration using flexi-depth iterations

    Science.gov (United States)

    Rastogi, Richa; Srivastava, Abhishek; Khonde, Kiran; Sirasala, Kirannmayi M.; Londhe, Ashutosh; Chavhan, Hitesh

    2015-07-01

    This paper presents an efficient parallel 3D Kirchhoff depth migration algorithm suitable for current class of multicore architecture. The fundamental Kirchhoff depth migration algorithm exhibits inherent parallelism however, when it comes to 3D data migration, as the data size increases the resource requirement of the algorithm also increases. This challenges its practical implementation even on current generation high performance computing systems. Therefore a smart parallelization approach is essential to handle 3D data for migration. The most compute intensive part of Kirchhoff depth migration algorithm is the calculation of traveltime tables due to its resource requirements such as memory/storage and I/O. In the current research work, we target this area and develop a competent parallel algorithm for post and prestack 3D Kirchhoff depth migration, using hybrid MPI+OpenMP programming techniques. We introduce a concept of flexi-depth iterations while depth migrating data in parallel imaging space, using optimized traveltime table computations. This concept provides flexibility to the algorithm by migrating data in a number of depth iterations, which depends upon the available node memory and the size of data to be migrated during runtime. Furthermore, it minimizes the requirements of storage, I/O and inter-node communication, thus making it advantageous over the conventional parallelization approaches. The developed parallel algorithm is demonstrated and analysed on Yuva II, a PARAM series of supercomputers. Optimization, performance and scalability experiment results along with the migration outcome show the effectiveness of the parallel algorithm.

  15. Parallelization of applications for networks with homogeneous and heterogeneous processors

    International Nuclear Information System (INIS)

    Colombet, L.

    1994-01-01

    The aim of this thesis is to study and develop efficient methods for parallelization of scientific applications on parallel computers with distributed memory. The first part presents two libraries of PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) communication tools. They allow implementation of programs on most parallel machines, but also on heterogeneous computer networks. This chapter illustrates the problems faced when trying to evaluate performances of networks with heterogeneous processors. To evaluate such performances, the concepts of speed-up and efficiency have been modified and adapted to account for heterogeneity. The second part deals with a study of parallel application libraries such as ScaLAPACK and with the development of communication masking techniques. The general concept is based on communication anticipation, in particular by pipelining message sending operations. Experimental results on Cray T3D and IBM SP1 machines validates the theoretical studies performed on basic algorithms of the libraries discussed above. Two examples of scientific applications are given: the first is a model of young stars for astrophysics and the other is a model of photon trajectories in the Compton effect. (J.S.). 83 refs., 65 figs., 24 tabs

  16. Programming voltage reduction in phase change memory cells with tungsten trioxide bottom heating layer/electrode

    International Nuclear Information System (INIS)

    Rao Feng; Song Zhitang; Gong Yuefeng; Wu Liangcai; Feng Songlin; Chen, Bomy

    2008-01-01

    A phase change memory cell with tungsten trioxide bottom heating layer/electrode is investigated. The crystalline tungsten trioxide heating layer promotes the temperature rise in the Ge 2 Sb 2 Te 5 layer which causes the reduction in the reset voltage compared to a conventional phase change memory cell. Theoretical thermal simulation and calculation for the reset process are applied to understand the thermal effect of the tungsten trioxide heating layer/electrode. The improvement in thermal efficiency of the PCM cell mainly originates from the low thermal conductivity of the crystalline tungsten trioxide material.

  17. Both the caspase CSP-1 and a caspase-independent pathway promote programmed cell death in parallel to the canonical pathway for apoptosis in Caenorhabditis elegans.

    Directory of Open Access Journals (Sweden)

    Daniel P Denning

    Full Text Available Caspases are cysteine proteases that can drive apoptosis in metazoans and have critical functions in the elimination of cells during development, the maintenance of tissue homeostasis, and responses to cellular damage. Although a growing body of research suggests that programmed cell death can occur in the absence of caspases, mammalian studies of caspase-independent apoptosis are confounded by the existence of at least seven caspase homologs that can function redundantly to promote cell death. Caspase-independent programmed cell death is also thought to occur in the invertebrate nematode Caenorhabditis elegans. The C. elegans genome contains four caspase genes (ced-3, csp-1, csp-2, and csp-3, of which only ced-3 has been demonstrated to promote apoptosis. Here, we show that CSP-1 is a pro-apoptotic caspase that promotes programmed cell death in a subset of cells fated to die during C. elegans embryogenesis. csp-1 is expressed robustly in late pachytene nuclei of the germline and is required maternally for its role in embryonic programmed cell deaths. Unlike CED-3, CSP-1 is not regulated by the APAF-1 homolog CED-4 or the BCL-2 homolog CED-9, revealing that csp-1 functions independently of the canonical genetic pathway for apoptosis. Previously we demonstrated that embryos lacking all four caspases can eliminate cells through an extrusion mechanism and that these cells are apoptotic. Extruded cells differ from cells that normally undergo programmed cell death not only by being extruded but also by not being engulfed by neighboring cells. In this study, we identify in csp-3; csp-1; csp-2 ced-3 quadruple mutants apoptotic cell corpses that fully resemble wild-type cell corpses: these caspase-deficient cell corpses are morphologically apoptotic, are not extruded, and are internalized by engulfing cells. We conclude that both caspase-dependent and caspase-independent pathways promote apoptotic programmed cell death and the phagocytosis of cell

  18. Parallelization of a Quantum-Classic Hybrid Model For Nanoscale Semiconductor Devices

    Directory of Open Access Journals (Sweden)

    Oscar Salas

    2011-07-01

    Full Text Available The expensive reengineering of the sequential software and the difficult parallel programming are two of the many technical and economic obstacles to the wide use of HPC. We investigate the chance to improve in a rapid way the performance of a numerical serial code for the simulation of the transport of a charged carriers in a Double-Gate MOSFET. We introduce the Drift-Diffusion-Schrödinger-Poisson (DDSP model and we study a rapid parallelization strategy of the numerical procedure on shared memory architectures.

  19. Evidence of Filamentary Switching in Oxide-based Memory Devices via Weak Programming and Retention Failure Analysis

    Science.gov (United States)

    Younis, Adnan; Chu, Dewei; Li, Sean

    2015-09-01

    Further progress in high-performance microelectronic devices relies on the development of novel materials and device architectures. However, the components and designs that are currently in use have reached their physical limits. Intensive research efforts, ranging from device fabrication to performance evaluation, are required to surmount these limitations. In this paper, we demonstrate that the superior bipolar resistive switching characteristics of a CeO2:Gd-based memory device can be manipulated by means of UV radiation, serving as a new degree of freedom. Furthermore, the metal oxide-based (CeO2:Gd) memory device was found to possess electrical and neuromorphic multifunctionalities. To investigate the underlying switching mechanism of the device, its plasticity behaviour was studied by imposing weak programming conditions. In addition, a short-term to long-term memory transition analogous to the forgetting process in the human brain, which is regarded as a key biological synaptic function for information processing and data storage, was realized. Based on a careful examination of the device’s retention behaviour at elevated temperatures, the filamentary nature of switching in such devices can be understood from a new perspective.

  20. Parallel thermal radiation transport in two dimensions

    International Nuclear Information System (INIS)

    Smedley-Stevenson, R.P.; Ball, S.R.

    2003-01-01

    This paper describes the distributed memory parallel implementation of a deterministic thermal radiation transport algorithm in a 2-dimensional ALE hydrodynamics code. The parallel algorithm consists of a variety of components which are combined in order to produce a state of the art computational capability, capable of solving large thermal radiation transport problems using Blue-Oak, the 3 Tera-Flop MPP (massive parallel processors) computing facility at AWE (United Kingdom). Particular aspects of the parallel algorithm are described together with examples of the performance on some challenging applications. (author)

  1. Parallel thermal radiation transport in two dimensions

    Energy Technology Data Exchange (ETDEWEB)

    Smedley-Stevenson, R.P.; Ball, S.R. [AWE Aldermaston (United Kingdom)

    2003-07-01

    This paper describes the distributed memory parallel implementation of a deterministic thermal radiation transport algorithm in a 2-dimensional ALE hydrodynamics code. The parallel algorithm consists of a variety of components which are combined in order to produce a state of the art computational capability, capable of solving large thermal radiation transport problems using Blue-Oak, the 3 Tera-Flop MPP (massive parallel processors) computing facility at AWE (United Kingdom). Particular aspects of the parallel algorithm are described together with examples of the performance on some challenging applications. (author)

  2. Protein Phosphatase 1-Dependent Transcriptional Programs for Long-Term Memory and Plasticity

    Science.gov (United States)

    Graff, Johannes; Koshibu, Kyoko; Jouvenceau, Anne; Dutar, Patrick; Mansuy, Isabelle M.

    2010-01-01

    Gene transcription is essential for the establishment and the maintenance of long-term memory (LTM) and for long-lasting forms of synaptic plasticity. The molecular mechanisms that control gene transcription in neuronal cells are complex and recruit multiple signaling pathways in the cytoplasm and the nucleus. Protein kinases (PKs) and…

  3. Effects of Early Serotonin Programming on Fear Response, Memory and Aggression

    Science.gov (United States)

    The neurotransmitter serotonin (5-HT) also acts as a neurogenic compound in the developing brain. Early administration of a 5-HT agonist could alter development of serotonergic circuitry, altering behaviors mediated by 5-HT signaling, including memory, fear and aggression. The present study was desi...

  4. A novel two-level dynamic parallel data scheme for large 3-D SN calculations

    International Nuclear Information System (INIS)

    Sjoden, G.E.; Shedlock, D.; Haghighat, A.; Yi, C.

    2005-01-01

    We introduce a new dynamic parallel memory optimization scheme for executing large scale 3-D discrete ordinates (Sn) simulations on distributed memory parallel computers. In order for parallel transport codes to be truly scalable, they must use parallel data storage, where only the variables that are locally computed are locally stored. Even with parallel data storage for the angular variables, cumulative storage requirements for large discrete ordinates calculations can be prohibitive. To address this problem, Memory Tuning has been implemented into the PENTRAN 3-D parallel discrete ordinates code as an optimized, two-level ('large' array, 'small' array) parallel data storage scheme. Memory Tuning can be described as the process of parallel data memory optimization. Memory Tuning dynamically minimizes the amount of required parallel data in allocated memory on each processor using a statistical sampling algorithm. This algorithm is based on the integral average and standard deviation of the number of fine meshes contained in each coarse mesh in the global problem. Because PENTRAN only stores the locally computed problem phase space, optimal two-level memory assignments can be unique on each node, depending upon the parallel decomposition used (hybrid combinations of angular, energy, or spatial). As demonstrated in the two large discrete ordinates models presented (a storage cask and an OECD MOX Benchmark), Memory Tuning can save a substantial amount of memory per parallel processor, allowing one to accomplish very large scale Sn computations. (authors)

  5. Collectively loading an application in a parallel computer

    Science.gov (United States)

    Aho, Michael E.; Attinella, John E.; Gooding, Thomas M.; Miller, Samuel J.; Mundy, Michael B.

    2016-01-05

    Collectively loading an application in a parallel computer, the parallel computer comprising a plurality of compute nodes, including: identifying, by a parallel computer control system, a subset of compute nodes in the parallel computer to execute a job; selecting, by the parallel computer control system, one of the subset of compute nodes in the parallel computer as a job leader compute node; retrieving, by the job leader compute node from computer memory, an application for executing the job; and broadcasting, by the job leader to the subset of compute nodes in the parallel computer, the application for executing the job.

  6. Data driven parallelism in experimental high energy physics applications

    International Nuclear Information System (INIS)

    Pohl, M.

    1987-01-01

    I present global design principles for the implementation of high energy physics data analysis code on sequential and parallel processors with mixed shared and local memory. Potential parallelism in the structure of high energy physics tasks is identified with granularity varying from a few times 10 8 instructions all the way down to a few times 10 4 instructions. It follows the hierarchical structure of detector and data acquisition systems. To take advantage of this - yet preserving the necessary portability of the code - I propose a computational model with purely data driven concurrency in Single Program Multiple Data (SPMD) mode. The task granularity is defined by varying the granularity of the central data structure manipulated. Concurrent processes coordiate themselves asynchroneously using simple lock constructs on parts of the data structure. Load balancing among processes occurs naturally. The scheme allows to map the internal layout of the data structure closely onto the layout of local and shared memory in a parallel architecture. It thus allows to optimize the application with respect to synchronization as well as data transport overheads. I present a coarse top level design for a portable implementation of this scheme on sequential machines, multiprocessor mainframes (e.g. IBM 3090), tightly coupled multiprocessors (e.g. RP-3) and loosely coupled processor arrays (e.g. LCAP, Emulating Processor Farms). (orig.)

  7. Data driven parallelism in experimental high energy physics applications

    Science.gov (United States)

    Pohl, Martin

    1987-08-01

    I present global design principles for the implementation of High Energy Physics data analysis code on sequential and parallel processors with mixed shared and local memory. Potential parallelism in the structure of High Energy Physics tasks is identified with granularity varying from a few times 10 8 instructions all the way down to a few times 10 4 instructions. It follows the hierarchical structure of detector and data acquisition systems. To take advantage of this - yet preserving the necessary portability of the code - I propose a computational model with purely data driven concurrency in Single Program Multiple Data (SPMD) mode. The Task granularity is defined by varying the granularity of the central data structure manipulated. Concurrent processes coordinate themselves asynchroneously using simple lock constructs on parts of the data structure. Load balancing among processes occurs naturally. The scheme allows to map the internal layout of the data structure closely onto the layout of local and shared memory in a parallel architecture. It thus allows to optimize the application with respect to synchronization as well as data transport overheads. I present a coarse top level design for a portable implementation of this scheme on sequential machines, multiprocessor mainframes (e.g. IBM 3090), tightly coupled multiprocessors (e.g. RP-3) and loosely coupled processor arrays (e.g. LCAP, Emulating Processor Farms).

  8. Parallelization of MRCI based on hole-particle symmetry.

    Science.gov (United States)

    Suo, Bing; Zhai, Gaohong; Wang, Yubin; Wen, Zhenyi; Hu, Xiangqian; Li, Lemin

    2005-01-15

    The parallel implementation of multireference configuration interaction program based on the hole-particle symmetry is described. The platform to implement the parallelization is an Intel-Architectural cluster consisting of 12 nodes, each of which is equipped with two 2.4-G XEON processors, 3-GB memory, and 36-GB disk, and are connected by a Gigabit Ethernet Switch. The dependence of speedup on molecular symmetries and task granularities is discussed. Test calculations show that the scaling with the number of nodes is about 1.9 (for C1 and Cs), 1.65 (for C2v), and 1.55 (for D2h) when the number of nodes is doubled. The largest calculation performed on this cluster involves 5.6 x 10(8) CSFs.

  9. Fast parallel event reconstruction

    CERN Multimedia

    CERN. Geneva

    2010-01-01

    On-line processing of large data volumes produced in modern HEP experiments requires using maximum capabilities of modern and future many-core CPU and GPU architectures.One of such powerful feature is a SIMD instruction set, which allows packing several data items in one register and to operate on all of them, thus achievingmore operations per clock cycle. Motivated by the idea of using the SIMD unit ofmodern processors, the KF based track fit has been adapted for parallelism, including memory optimization, numerical analysis, vectorization with inline operator overloading, and optimization using SDKs. The speed of the algorithm has been increased in 120000 times with 0.1 ms/track, running in parallel on 16 SPEs of a Cell Blade computer.  Running on a Nehalem CPU with 8 cores it shows the processing speed of 52 ns/track using the Intel Threading Building Blocks. The same KF algorithm running on an Nvidia GTX 280 in the CUDA frameworkprovi...

  10. An Introduction to Parallel Computation R

    Indian Academy of Sciences (India)

    How are they programmed? This article provides an introduction. A parallel computer is a network of processors built for ... and have been used to solve problems much faster than a single ... in parallel computer design is to select an organization which ..... The most ambitious approach to parallel computing is to develop.

  11. Exploiting Symmetry on Parallel Architectures.

    Science.gov (United States)

    Stiller, Lewis Benjamin

    1995-01-01

    This thesis describes techniques for the design of parallel programs that solve well-structured problems with inherent symmetry. Part I demonstrates the reduction of such problems to generalized matrix multiplication by a group-equivariant matrix. Fast techniques for this multiplication are described, including factorization, orbit decomposition, and Fourier transforms over finite groups. Our algorithms entail interaction between two symmetry groups: one arising at the software level from the problem's symmetry and the other arising at the hardware level from the processors' communication network. Part II illustrates the applicability of our symmetry -exploitation techniques by presenting a series of case studies of the design and implementation of parallel programs. First, a parallel program that solves chess endgames by factorization of an associated dihedral group-equivariant matrix is described. This code runs faster than previous serial programs, and discovered it a number of results. Second, parallel algorithms for Fourier transforms for finite groups are developed, and preliminary parallel implementations for group transforms of dihedral and of symmetric groups are described. Applications in learning, vision, pattern recognition, and statistics are proposed. Third, parallel implementations solving several computational science problems are described, including the direct n-body problem, convolutions arising from molecular biology, and some communication primitives such as broadcast and reduce. Some of our implementations ran orders of magnitude faster than previous techniques, and were used in the investigation of various physical phenomena.

  12. 7th International Workshop on Parallel Tools for High Performance Computing

    CERN Document Server

    Gracia, José; Nagel, Wolfgang; Resch, Michael

    2014-01-01

    Current advances in High Performance Computing (HPC) increasingly impact efficient software development workflows. Programmers for HPC applications need to consider trends such as increased core counts, multiple levels of parallelism, reduced memory per core, and I/O system challenges in order to derive well performing and highly scalable codes. At the same time, the increasing complexity adds further sources of program defects. While novel programming paradigms and advanced system libraries provide solutions for some of these challenges, appropriate supporting tools are indispensable. Such tools aid application developers in debugging, performance analysis, or code optimization and therefore make a major contribution to the development of robust and efficient parallel software. This book introduces a selection of the tools presented and discussed at the 7th International Parallel Tools Workshop, held in Dresden, Germany, September 3-4, 2013.  

  13. EDDY RESOLVING NUTRIENT ECODYNAMICS IN THE GLOBAL PARALLEL OCEAN PROGRAM AND CONNECTIONS WITH TRACE GASES IN THE SULFUR, HALOGEN AND NMHC CYCLES

    Energy Technology Data Exchange (ETDEWEB)

    S. CHU; S. ELLIOTT

    2000-08-01

    Ecodynamics and the sea-air transfer of climate relevant trace gases are intimately coupled in the oceanic mixed layer. Ventilation of species such as dimethyl sulfide and methyl bromide constitutes a key linkage within the earth system. We are creating a research tool for the study of marine trace gas distributions by implementing coupled ecology-gas chemistry in the high resolution Parallel Ocean Program (POP). The fundamental circulation model is eddy resolving, with cell sizes averaging 0.15 degree (lat/long). Here we describe ecochemistry integration. Density dependent mortality and iron geochemistry have enhanced agreement with chlorophyll measurements. Indications are that dimethyl sulfide production rates must be adjusted for latitude dependence to match recent compilations. This may reflect the need for phytoplankton to conserve nitrogen by favoring sulfurous osmolytes. Global simulations are also available for carbonyl sulfide, the methyl halides and for nonmethane hydrocarbons. We discuss future applications including interaction with atmospheric chemistry models, high resolution biogeochemical snapshots and the study of open ocean fertilization.

  14. Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage

    Energy Technology Data Exchange (ETDEWEB)

    Kim, Jungwon [ORNL; Sajjapongse, Kittisak [ORNL; Lee, Seyong [ORNL; Vetter, Jeffrey S [ORNL

    2017-01-01

    A surprising development in recently announced HPC platforms is the addition of, sometimes massive amounts of, persistent (nonvolatile) memory (NVM) in order to increase memory capacity and compensate for plateauing I/O capabilities. However, there are no portable and scalable programming interfaces using aggregate NVM effectively. This paper introduces Papyrus: a new software system built to exploit emerging capability of NVM in HPC architectures. Papyrus (or Parallel Aggregate Persistent -YRU- Storage) is a novel programming system that provides features for scalable, aggregate, persistent memory in an extreme-scale system for typical HPC usage scenarios. Papyrus mainly consists of Papyrus Virtual File System (VFS) and Papyrus Template Container Library (TCL). Papyrus VFS provides a uniform aggregate NVM storage image across diverse NVM architectures. It enables Papyrus TCL to provide a portable and scalable high-level container programming interface whose data elements are distributed across multiple NVM nodes without requiring the user to handle complex communication, synchronization, replication, and consistency model. We evaluate Papyrus on two HPC systems, including UTK Beacon and NERSC Cori, using real NVM storage devices.

  15. Parallel Numerical Simulations of Water Reservoirs

    Science.gov (United States)

    Torres, Pedro; Mangiavacchi, Norberto

    2010-11-01

    The study of the water flow and scalar transport in water reservoirs is important for the determination of the water quality during the initial stages of the reservoir filling and during the life of the reservoir. For this scope, a parallel 2D finite element code for solving the incompressible Navier-Stokes equations coupled with scalar transport was implemented using the message-passing programming model, in order to perform simulations of hidropower water reservoirs in a computer cluster environment. The spatial discretization is based on the MINI element that satisfies the Babuska-Brezzi (BB) condition, which provides sufficient conditions for a stable mixed formulation. All the distributed data structures needed in the different stages of the code, such as preprocessing, solving and post processing, were implemented using the PETSc library. The resulting linear systems for the velocity and the pressure fields were solved using the projection method, implemented by an approximate block LU factorization. In order to increase the parallel performance in the solution of the linear systems, we employ the static condensation method for solving the intermediate velocity at vertex and centroid nodes separately. We compare performance results of the static condensation method with the approach of solving the complete system. In our tests the static condensation method shows better performance for large problems, at the cost of an increased memory usage. Performance results for other intensive parts of the code in a computer cluster are also presented.

  16. Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

    Science.gov (United States)

    Rostrup, Scott; De Sterck, Hans

    2010-12-01

    Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL

  17. Parallelization of 2-D lattice Boltzmann codes

    International Nuclear Information System (INIS)

    Suzuki, Soichiro; Kaburaki, Hideo; Yokokawa, Mitsuo.

    1996-03-01

    Lattice Boltzmann (LB) codes to simulate two dimensional fluid flow are developed on vector parallel computer Fujitsu VPP500 and scalar parallel computer Intel Paragon XP/S. While a 2-D domain decomposition method is used for the scalar parallel LB code, a 1-D domain decomposition method is used for the vector parallel LB code to be vectorized along with the axis perpendicular to the direction of the decomposition. High parallel efficiency of 95.1% by the vector parallel calculation on 16 processors with 1152x1152 grid and 88.6% by the scalar parallel calculation on 100 processors with 800x800 grid are obtained. The performance models are developed to analyze the performance of the LB codes. It is shown by our performance models that the execution speed of the vector parallel code is about one hundred times faster than that of the scalar parallel code with the same number of processors up to 100 processors. We also analyze the scalability in keeping the available memory size of one processor element at maximum. Our performance model predicts that the execution time of the vector parallel code increases about 3% on 500 processors. Although the 1-D domain decomposition method has in general a drawback in the interprocessor communication, the vector parallel LB code is still suitable for the large scale and/or high resolution simulations. (author)

  18. Parallelization of 2-D lattice Boltzmann codes

    Energy Technology Data Exchange (ETDEWEB)

    Suzuki, Soichiro; Kaburaki, Hideo; Yokokawa, Mitsuo

    1996-03-01

    Lattice Boltzmann (LB) codes to simulate two dimensional fluid flow are developed on vector parallel computer Fujitsu VPP500 and scalar parallel computer Intel Paragon XP/S. While a 2-D domain decomposition method is used for the scalar parallel LB code, a 1-D domain decomposition method is used for the vector parallel LB code to be vectorized along with the axis perpendicular to the direction of the decomposition. High parallel efficiency of 95.1% by the vector parallel calculation on 16 processors with 1152x1152 grid and 88.6% by the scalar parallel calculation on 100 processors with 800x800 grid are obtained. The performance models are developed to analyze the performance of the LB codes. It is shown by our performance models that the execution speed of the vector parallel code is about one hundred times faster than that of the scalar parallel code with the same number of processors up to 100 processors. We also analyze the scalability in keeping the available memory size of one processor element at maximum. Our performance model predicts that the execution time of the vector parallel code increases about 3% on 500 processors. Although the 1-D domain decomposition method has in general a drawback in the interprocessor communication, the vector parallel LB code is still suitable for the large scale and/or high resolution simulations. (author).

  19. Parallel processing for fluid dynamics applications

    International Nuclear Information System (INIS)

    Johnson, G.M.

    1989-01-01

    The impact of parallel processing on computational science and, in particular, on computational fluid dynamics is growing rapidly. In this paper, particular emphasis is given to developments which have occurred within the past two years. Parallel processing is defined and the reasons for its importance in high-performance computing are reviewed. Parallel computer architectures are classified according to the number and power of their processing units, their memory, and the nature of their connection scheme. Architectures which show promise for fluid dynamics applications are emphasized. Fluid dynamics problems are examined for parallelism inherent at the physical level. CFD algorithms and their mappings onto parallel architectures are discussed. Several example are presented to document the performance of fluid dynamics applications on present-generation parallel processing devices

  20. Design considerations for parallel graphics libraries

    Science.gov (United States)

    Crockett, Thomas W.

    1994-01-01

    Applications which run on parallel supercomputers are often characterized by massive datasets. Converting these vast collections of numbers to visual form has proven to be a powerful aid to comprehension. For a variety of reasons, it may be desirable to provide this visual feedback at runtime. One way to accomplish this is to exploit the available parallelism to perform graphics operations in place. In order to do this, we need appropriate parallel rendering algorithms and library interfaces. This paper provides a tutorial introduction to some of the issues which arise in designing parallel graphics libraries and their underlying rendering algorithms. The focus is on polygon rendering for distributed memory message-passing systems. We illustrate our discussion with examples from PGL, a parallel graphics library which has been developed on the Intel family of parallel systems.