WorldWideScience

Sample records for memory parallel programming

  1. Synthetic models of distributed memory parallel programs

    Energy Technology Data Exchange (ETDEWEB)

    Poplawski, D.A. (Michigan Technological Univ., Houghton, MI (USA). Dept. of Computer Science)

    1990-09-01

    This paper deals with the construction and use of simple synthetic programs that model the behavior of more complex, real parallel programs. Synthetic programs can be used in many ways: to construct an easily ported suite of benchmark programs, to experiment with alternate parallel implementations of a program without actually writing them, and to predict the behavior and performance of an algorithm on a new or hypothetical machine. Synthetic programs are constructed easily from scratch, from existing programs, and can even be constructed using nothing but information obtained from traces of the real program's execution.

  2. Optimizing FORTRAN Programs for Hierarchical Memory Parallel Processing Systems

    Institute of Scientific and Technical Information of China (English)

    金国华; 陈福接

    1993-01-01

    Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism.

  3. Optimized Parallel Execution of Declarative Programs on Distributed Memory Multiprocessors

    Institute of Scientific and Technical Information of China (English)

    沈美明; 田新民; 等

    1993-01-01

    In this paper,we focus on the compiling implementation of parlalel logic language PARLOG and functional language ML on distributed memory multiprocessors.Under the graph rewriting framework, a Heterogeneous Parallel Graph Rewriting Execution Model(HPGREM)is presented firstly.Then based on HPGREM,a parallel abstact machine PAM/TGR is described.Furthermore,several optimizing compilation schemes for executing declarative programs on transputer array are proposed.The performance statistics on transputer array demonstrate the effectiveness of our model,parallel abstract machine,optimizing compilation strategies and compiler.

  4. Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++

    Science.gov (United States)

    Bodin, Francois; Priol, Thierry; Mehrotra, Piyush; Gannon, Dennis

    1994-01-01

    Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model.

  5. A comparison of distributed memory and virtual shared memory parallel programming models

    Energy Technology Data Exchange (ETDEWEB)

    Keane, J.A. [Univ. of Manchester (United Kingdom). Dept. of Computer Science; Grant, A.J. [Univ. of Manchester (United Kingdom). Computer Graphics Unit; Xu, M.Q. [Argonne National Lab., IL (United States)

    1993-04-01

    The virtues of the different parallel programming models, shared memory and distributed memory, have been much debated. Conventionally the debate could be reduced to programming convenience on the one hand, and high salability factors on the other. More recently the debate has become somewhat blurred with the provision of virtual shared memory models built on machines with physically distributed memory. The intention of such models/machines is to provide scalable shared memory, i.e. to provide both programmer convenience and high salability. In this paper, the different models are considered from experiences gained with a number of system ranging from applications in both commerce and science to languages and operating systems. Case studies are introduced as appropriate.

  6. MulticoreBSP for C : A high-performance library for shared-memory parallel programming

    NARCIS (Netherlands)

    Yzelman, A. N.; Bisseling, R. H.; Roose, D.; Meerbergen, K.

    2014-01-01

    The bulk synchronous parallel (BSP) model, as well as parallel programming interfaces based on BSP, classically target distributed-memory parallel architectures. In earlier work, Yzelman and Bisseling designed a MulticoreBSP for Java library specifically for shared-memory architectures. In the prese

  7. Deterministic Consistency: A Programming Model for Shared Memory Parallelism

    OpenAIRE

    Aviram, Amittai; Ford, Bryan

    2009-01-01

    The difficulty of developing reliable parallel software is generating interest in deterministic environments, where a given program and input can yield only one possible result. Languages or type systems can enforce determinism in new code, and runtime systems can impose synthetic schedules on legacy parallel code. To parallelize existing serial code, however, we would like a programming model that is naturally deterministic without language restrictions or artificial scheduling. We propose "...

  8. MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

    Science.gov (United States)

    Taft, James R.

    1999-01-01

    Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.

  9. Introduction to parallel programming

    CERN Document Server

    Brawer, Steven

    1989-01-01

    Introduction to Parallel Programming focuses on the techniques, processes, methodologies, and approaches involved in parallel programming. The book first offers information on Fortran, hardware and operating system models, and processes, shared memory, and simple parallel programs. Discussions focus on processes and processors, joining processes, shared memory, time-sharing with multiple processors, hardware, loops, passing arguments in function/subroutine calls, program structure, and arithmetic expressions. The text then elaborates on basic parallel programming techniques, barriers and race

  10. Parallel Implementation of a Semidefinite Programming Solver based on CSDP in a distributed memory cluster

    NARCIS (Netherlands)

    Ivanov, I.D.; de Klerk, E.

    2007-01-01

    In this paper we present the algorithmic framework and practical aspects of implementing a parallel version of a primal-dual semidefinite programming solver on a distributed memory computer cluster. Our implementation is based on the CSDP solver and uses a message passing interface (MPI), and the Sc

  11. Parallel Implementation of a Semidefinite Programming Solver based on CSDP in a distributed memory cluster

    NARCIS (Netherlands)

    Ivanov, I.D.; de Klerk, E.

    2007-01-01

    In this paper we present the algorithmic framework and practical aspects of implementing a parallel version of a primal-dual semidefinite programming solver on a distributed memory computer cluster. Our implementation is based on the CSDP solver and uses a message passing interface (MPI), and the Sc

  12. Remote Memory Access: A Case for Portable, Efficient and Library Independent Parallel Programming

    Directory of Open Access Journals (Sweden)

    Alexandros V. Gerbessiotis

    2004-01-01

    Full Text Available In this work we make a strong case for remote memory access (RMA as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.

  13. Parallel External Memory Graph Algorithms

    DEFF Research Database (Denmark)

    Arge, Lars Allan; Goodrich, Michael T.; Sitchinava, Nodari

    2010-01-01

    In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one o f the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest...... an optimal speedup of ¿(P) in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts....

  14. 3-D parallel program for numerical calculation of gas dynamics problems with heat conductivity on distributed memory computational systems (CS)

    Energy Technology Data Exchange (ETDEWEB)

    Sofronov, I.D.; Voronin, B.L.; Butnev, O.I. [VNIIEF (Russian Federation)] [and others

    1997-12-31

    The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle. The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.

  15. Parallel Execution of Prolog on Shared—Memory Multiprocessors

    Institute of Scientific and Technical Information of China (English)

    高耀清; 王鼎兴; 等

    1993-01-01

    Logic programs offer many opportunities for the exploitation of parallelism.But the parallel execution of a task incurs various overheads.This paper focuses on the issues relevant to parallelizing Prolog on shared-memory multiprocessors efficiently.

  16. Developing Parallel Programs

    Directory of Open Access Journals (Sweden)

    Ranjan Sen

    2012-09-01

    Full Text Available Parallel programming is an extension of sequential programming; today, it is becoming the mainstream paradigm in day-to-day information processing. Its aim is to build the fastest programs on parallel computers. The methodologies for developing a parallelprogram can be put into integrated frameworks. Development focuses on algorithm, languages, and how the program is deployed on the parallel computer.

  17. Parallelization of While Loops in Nested Loop Programs for Shared-Memory Multiprocessor Systems

    NARCIS (Netherlands)

    Geuns, Stefan J.; Bekooij, Marco J.G.; Bijlsma, Tjerk; Corporaal, Henk

    2011-01-01

    Many applications contain loops with an undetermined number of iterations. These loops have to be parallelized in order to increase the throughput when executed on an embedded multiprocessor platform. This paper presents a method to automatically extract a parallel task graph based on function level

  18. The ParaScope parallel programming environment

    Science.gov (United States)

    Cooper, Keith D.; Hall, Mary W.; Hood, Robert T.; Kennedy, Ken; Mckinley, Kathryn S.; Mellor-Crummey, John M.; Torczon, Linda; Warren, Scott K.

    1993-01-01

    The ParaScope parallel programming environment, developed to support scientific programming of shared-memory multiprocessors, includes a collection of tools that use global program analysis to help users develop and debug parallel programs. This paper focuses on ParaScope's compilation system, its parallel program editor, and its parallel debugging system. The compilation system extends the traditional single-procedure compiler by providing a mechanism for managing the compilation of complete programs. Thus, ParaScope can support both traditional single-procedure optimization and optimization across procedure boundaries. The ParaScope editor brings both compiler analysis and user expertise to bear on program parallelization. It assists the knowledgeable user by displaying and managing analysis and by providing a variety of interactive program transformations that are effective in exposing parallelism. The debugging system detects and reports timing-dependent errors, called data races, in execution of parallel programs. The system combines static analysis, program instrumentation, and run-time reporting to provide a mechanical system for isolating errors in parallel program executions. Finally, we describe a new project to extend ParaScope to support programming in FORTRAN D, a machine-independent parallel programming language intended for use with both distributed-memory and shared-memory parallel computers.

  19. Parallel programming with PCN

    Energy Technology Data Exchange (ETDEWEB)

    Foster, I.; Tuecke, S.

    1991-12-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).

  20. PDDP, A Data Parallel Programming Model

    Directory of Open Access Journals (Sweden)

    Karen H. Warren

    1996-01-01

    Full Text Available PDDP, the parallel data distribution preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP implements high-performance Fortran-compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the WHERE construct. Distributed data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

  1. On the Problem of Optimizing Parallel Programs for Complex Memory Hierarchies

    Institute of Scientific and Technical Information of China (English)

    金国华; 陈福接

    1994-01-01

    Based on a thorough study of the relationship between array element accesses and loop indices of the nested loop,a method is presented with which the staggering relation and the compacting relation between the threads of the nested loop (either with a single linear function of with multiple linear functions) can be determined at compile-time,and accordingly the nested loop (either perfectly nested one or imperfectly nested one) can be restructured to avoid the thrashing problem.Due to its simplicity,our method can be efficiently implemented in any parallel compiler,and the improvement of the performance is significant as shown be the experimental results.

  2. Parallel Programming Paradigms

    Science.gov (United States)

    1987-07-01

    GOVT ACCESSION NO. 3. RECIPIENT’S CATALOG NUMBER 4, TITL.: td Subtitle) S. TYPE OF REPORT & PERIOD COVERED Parallel Programming Paradigms...studied. 0A ITI is Jt, t’i- StCUI-eASSIICATION OFvrHIS PAGFrm".n Def. £ntered, Parallel Programming Paradigms Philip Arne Nelson Department of Computer...8416878 and by the Office of Naval Research Contracts No. N00014-86-K-0264 and No. N00014-85- K-0328. 8 ?~~ O .G 1 49 II Parallel Programming Paradigms

  3. Parallel programming with PCN

    Energy Technology Data Exchange (ETDEWEB)

    Foster, I.; Tuecke, S.

    1993-01-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and Cthat allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous ftp from Argonne National Laboratory in the directory pub/pcn at info.mcs. ani.gov (cf. Appendix A). This version of this document describes PCN version 2.0, a major revision of the PCN programming system. It supersedes earlier versions of this report.

  4. Parallelization of the molecular dynamics code GROMOS87 for distributed memory parallel architectures

    NARCIS (Netherlands)

    Green, DG; Meacham, KE; vanHoesel, F; Hertzberger, B; Serazzi, G

    1995-01-01

    This paper describes the techniques and methodologies employed during parallelization of the Molecular Dynamics (MD) code GROMOS87, with the specific requirement that the program run efficiently on a range of distributed-memory parallel platforms. We discuss the preliminary results of our parallel

  5. Parallel Programming with Intel Parallel Studio XE

    CERN Document Server

    Blair-Chappell , Stephen

    2012-01-01

    Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the

  6. Customizable Memory Schemes for Data Parallel Architectures

    NARCIS (Netherlands)

    Gou, C.

    2011-01-01

    Memory system efficiency is crucial for any processor to achieve high performance, especially in the case of data parallel machines. Processing capabilities of parallel lanes will be wasted, when data requests are not accomplished in a sustainable and timely manner. Irregular vector memory accesses

  7. Customizable Memory Schemes for Data Parallel Architectures

    NARCIS (Netherlands)

    Gou, C.

    2011-01-01

    Memory system efficiency is crucial for any processor to achieve high performance, especially in the case of data parallel machines. Processing capabilities of parallel lanes will be wasted, when data requests are not accomplished in a sustainable and timely manner. Irregular vector memory accesses

  8. Parallel programming with MPI

    Energy Technology Data Exchange (ETDEWEB)

    Tatebe, Osamu [Electrotechnical Lab., Tsukuba, Ibaraki (Japan)

    1998-03-01

    MPI is a practical, portable, efficient and flexible standard for message passing, which has been implemented on most MPPs and network of workstations by machine vendors, universities and national laboratories. MPI avoids specifying how operations will take place and superfluous work to achieve efficiency as well as portability, and is also designed to encourage overlapping communication and computation to hide communication latencies. This presentation briefly explains the MPI standard, and comments on efficient parallel programming to improve performance. (author)

  9. Architectural Adaptability in Parallel Programming

    Science.gov (United States)

    1991-05-01

    I AD-A247 516 Architectural Adaptability in Parallel Programming Lawrence Alan Crowl Technical Report 381 May 1991 92-06322 UNIVERSITY OF ROC R...COMPUTER SCIENCE Best Avai~lable Copy Architectural Adaptability in Parallel Programming by Lawrence Alan Crowl Submitted in Partial Fulfillment of the...in the development of their programs. In applying abstraction to parallel programming , we can use abstractions to represent potential parallelism

  10. Computer Assisted Parallel Program Generation

    CERN Document Server

    Kawata, Shigeo

    2015-01-01

    Parallel computation is widely employed in scientific researches, engineering activities and product development. Parallel program writing itself is not always a simple task depending on problems solved. Large-scale scientific computing, huge data analyses and precise visualizations, for example, would require parallel computations, and the parallel computing needs the parallelization techniques. In this Chapter a parallel program generation support is discussed, and a computer-assisted parallel program generation system P-NCAS is introduced. Computer assisted problem solving is one of key methods to promote innovations in science and engineering, and contributes to enrich our society and our life toward a programming-free environment in computing science. Problem solving environments (PSE) research activities had started to enhance the programming power in 1970's. The P-NCAS is one of the PSEs; The PSE concept provides an integrated human-friendly computational software and hardware system to solve a target ...

  11. Programming massively parallel processors a hands-on approach

    CERN Document Server

    Kirk, David B

    2010-01-01

    Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs. Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA. The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who ...

  12. Memory-based parallel data output controller

    Science.gov (United States)

    Stattel, R. J.; Niswander, J. K. (Inventor)

    1984-01-01

    A memory-based parallel data output controller employs associative memories and memory mapping to decommutate multiple channels of telemetry data. The output controller contains a random access memory (RAM) which has at least as many address locations as there are channels. A word counter addresses the RAM which provides as it outputs an encoded peripheral device number and a MSB/LSB-first flag. The encoded device number and a bit counter address a second RAM which contains START and STOP flags to pick out the required bits from the specified word number. The LSB/MSB, START and STOP flags, along with the serial input digital data go to a control block which selectively fills a shift register used to drive the parallel data output bus.

  13. Approach of generating parallel programs from parallelized algorithm design strategies

    Institute of Scientific and Technical Information of China (English)

    WAN Jian-yi; LI Xiao-ying

    2008-01-01

    Today, parallel programming is dominated by message passing libraries, such as message passing interface (MPI). This article intends to simplify parallel programming by generating parallel programs from parallelized algorithm design strategies. It uses skeletons to abstract parallelized algorithm design strategies, as well as parallel architectures. Starting from problem specification, an abstract parallel abstract programming language+ (Apla+) program is generated from parallelized algorithm design strategies and problem-specific function definitions. By combining with parallel architectures, implicity of parallelism inside the parallelized algorithm design strategies is exploited. With implementation and transformation, C++ and parallel virtual machine (CPPVM) parallel program is finally generated. Parallelized branch and bound (B&B) algorithm design strategy and parallelized divide and conquer (D & C) algorithm design strategy are studied in this article as examples. And it also illustrates the approach with a case study.

  14. A parallel memory architecture for video coding

    Institute of Scientific and Technical Information of China (English)

    Jian-ying PENG; Xiao-lang YAN; De-xian LI; Li-zhong CHEN

    2008-01-01

    To efficiently exploit the performance of single instruction multiple data (SIMD) architectures for video coding,a parallel memory architecture with power-of-two memory modules is proposed. It employs two novel skewing schemes to provide conflict-flee access to adjacent elements (8-bit and 16-bit data types) or with power-of-two intervals in both horizontal and vertical dircctions,which were not possible in previous parallel memory architectures. Area consumptions and delay estimations are given respectively with 4,8 and 16 memory modules. Under a 0.18-μm CMOS technology,the synthcsis results show that the proposed system can achieve 230 MHz clock frequency with 16 memory modules at the cost of 19k gates when read and write latcncies are 3 and 2 clock cycles,respectively. We implement the proposed parallel memory architecture on a video signal proccssor (VSP). The results show that VSP enhanced with the proposed architecture achieves 1.28x speedups for H.264 real-time decoding.

  15. The PISCES 2 parallel programming environment

    Science.gov (United States)

    Pratt, Terrence W.

    1987-01-01

    PISCES 2 is a programming environment for scientific and engineering computations on MIMD parallel computers. It is currently implemented on a flexible FLEX/32 at NASA Langley, a 20 processor machine with both shared and local memories. The environment provides an extended Fortran for applications programming, a configuration environment for setting up a run on the parallel machine, and a run-time environment for monitoring and controlling program execution. This paper describes the overall design of the system and its implementation on the FLEX/32. Emphasis is placed on several novel aspects of the design: the use of a carefully defined virtual machine, programmer control of the mapping of virtual machine to actual hardware, forces for medium-granularity parallelism, and windows for parallel distribution of data. Some preliminary measurements of storage use are included.

  16. XJava: Exploiting Parallelism with Object-Oriented Stream Programming

    Science.gov (United States)

    Otto, Frank; Pankratius, Victor; Tichy, Walter F.

    This paper presents the XJava compiler for parallel programs. It exploits parallelism based on an object-oriented stream programming paradigm. XJava extends Java with new parallel constructs that do not expose programmers to low-level details of parallel programming on shared memory machines. Tasks define composable parallel activities, and new operators allow an easier expression of parallel patterns, such as pipelines, divide and conquer, or master/worker. We also present an automatic run-time mechanism that extends our previous work to automatically map tasks and parallel statements to threads.

  17. Patterns For Parallel Programming

    CERN Document Server

    Mattson, Timothy G; Massingill, Berna L

    2005-01-01

    From grids and clusters to next-generation game consoles, parallel computing is going mainstream. Innovations such as Hyper-Threading Technology, HyperTransport Technology, and multicore microprocessors from IBM, Intel, and Sun are accelerating the movement's growth. Only one thing is missing: programmers with the skills to meet the soaring demand for parallel software.

  18. PDDP: A data parallel programming model. Revision 1

    Energy Technology Data Exchange (ETDEWEB)

    Warren, K.H.

    1995-06-01

    PDDP, the Parallel Data Distribution Preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP impelments High Performance Fortran compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the (WRERE?) construct. Distribued data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared-memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

  19. Parallel programming with Python

    CERN Document Server

    Palach, Jan

    2014-01-01

    A fast, easy-to-follow and clear tutorial to help you develop Parallel computing systems using Python. Along with explaining the fundamentals, the book will also introduce you to slightly advanced concepts and will help you in implementing these techniques in the real world. If you are an experienced Python programmer and are willing to utilize the available computing resources by parallelizing applications in a simple way, then this book is for you. You are required to have a basic knowledge of Python development to get the most of this book.

  20. Practical parallel programming

    CERN Document Server

    Bauer, Barr E

    2014-01-01

    This is the book that will teach programmers to write faster, more efficient code for parallel processors. The reader is introduced to a vast array of procedures and paradigms on which actual coding may be based. Examples and real-life simulations using these devices are presented in C and FORTRAN.

  1. Multilanguage parallel programming of heterogeneous machines

    Energy Technology Data Exchange (ETDEWEB)

    Bisiani, R.; Forin, A.

    1988-08-01

    The authors designed and implemented a system, Agora, that supports the development of multilanguage parallel applications for heterogeneous machines. Agora hinges on two ideas: the first one is that shared memory can be a suitable abstraction to program concurrent, multilanguage modules running on heterogeneous machines. The second one is that a shared memory abstraction can efficiently supported across different computer architectures that are not connected by a physical shared memory, for example local are network workstations or ensemble machines. Agora has been in use for more than a year. This paper describes the Agora shared memory and its software implementation on both tightly and loosely coupled architectures. Measurements of the current implementation are also included.

  2. Data-Parallel Programming in a Multithreaded Environment

    Directory of Open Access Journals (Sweden)

    Matthew Haines

    1997-01-01

    Full Text Available Research on programming distributed memory multiprocessors has resulted in a well-understood programming model, namely data-parallel programming. However, data-parallel programming in a multithreaded environment is far less understood. For example, if multiple threads within the same process belong to different data-parallel computations, then the architecture, compiler, or run-time system must ensure that relative indexing and collective operations are handled properly and efficiently. We introduce a run-time-based solution for data-parallel programming in a distributed memory environment that handles the problems of relative indexing and collective communications among thread groups. As a result, the data-parallel programming model can now be executed in a multithreaded environment, such as a system using threads to support both task and data parallelism.

  3. Plasmonics and the parallel programming problem

    Science.gov (United States)

    Vishkin, Uzi; Smolyaninov, Igor; Davis, Chris

    2007-02-01

    While many parallel computers have been built, it has generally been too difficult to program them. Now, all computers are effectively becoming parallel machines. Biannual doubling in the number of cores on a single chip, or faster, over the coming decade is planned by most computer vendors. Thus, the parallel programming problem is becoming more critical. The only known solution to the parallel programming problem in the theory of computer science is through a parallel algorithmic theory called PRAM. Unfortunately, some of the PRAM theory assumptions regarding the bandwidth between processors and memories did not properly reflect a parallel computer that could be built in previous decades. Reaching memories, or other processors in a multi-processor organization, required off-chip connections through pins on the boundary of each electric chip. Using the number of transistors that is becoming available on chip, on-chip architectures that adequately support the PRAM are becoming possible. However, the bandwidth of off-chip connections remains insufficient and the latency remains too high. This creates a bottleneck at the boundary of the chip for a PRAM-On-Chip architecture. This also prevents scalability to larger "supercomputing" organizations spanning across many processing chips that can handle massive amounts of data. Instead of connections through pins and wires, power-efficient CMOS-compatible on-chip conversion to plasmonic nanowaveguides is introduced for improved latency and bandwidth. Proper incorporation of our ideas offer exciting avenues to resolving the parallel programming problem, and an alternative way for building faster, more useable and much more compact supercomputers.

  4. A Tutorial on Parallel and Concurrent Programming in Haskell

    Science.gov (United States)

    Peyton Jones, Simon; Singh, Satnam

    This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.

  5. High-Level Parallel Programming.

    Science.gov (United States)

    parallel programming languages. These issues were evaluated via the utilization of a language called UC. UC is a programming language aimed at balancing notational simplicity with execution efficiency and portability. UC accomplishes this by separating the programming task from the efficiency issues. This report gives a description of the language, its current implementation, its verification methodology and its use in designing various

  6. The FORCE: A highly portable parallel programming language

    Science.gov (United States)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them.

  7. The FORCE - A highly portable parallel programming language

    Science.gov (United States)

    Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

    1989-01-01

    This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.

  8. Writing parallel programs that work

    CERN Document Server

    CERN. Geneva

    2012-01-01

    Serial algorithms typically run inefficiently on parallel machines. This may sound like an obvious statement, but it is the root cause of why parallel programming is considered to be difficult. The current state of the computer industry is still that almost all programs in existence are serial. This talk will describe the techniques used in the Intel Parallel Studio to provide a developer with the tools necessary to understand the behaviors and limitations of the existing serial programs. Once the limitations are known the developer can refactor the algorithms and reanalyze the resulting programs with the tools in the Intel Parallel Studio to create parallel programs that work. About the speaker Paul Petersen is a Sr. Principal Engineer in the Software and Solutions Group (SSG) at Intel. He received a Ph.D. degree in Computer Science from the University of Illinois in 1993. After UIUC, he was employed at Kuck and Associates, Inc. (KAI) working on auto-parallelizing compiler (KAP), and was involved in th...

  9. A Heterogeneous Parallel Programming Capability

    Science.gov (United States)

    1990-11-30

    the various implementations of Express attempted to address only the first of these is- sues - providing a portable, standard platform for parallel ... programming on a wide variety of dif- I I! 5 ferent systems. Each implementation, however, was independent, but allowed programs to execute on a single

  10. Parallel implementation of inverse adding-doubling and Monte Carlo multi-layered programs for high performance computing systems with shared and distributed memory

    Science.gov (United States)

    Chugunov, Svyatoslav; Li, Changying

    2015-09-01

    Parallel implementation of two numerical tools popular in optical studies of biological materials-Inverse Adding-Doubling (IAD) program and Monte Carlo Multi-Layered (MCML) program-was developed and tested in this study. The implementation was based on Message Passing Interface (MPI) and standard C-language. Parallel versions of IAD and MCML programs were compared to their sequential counterparts in validation and performance tests. Additionally, the portability of the programs was tested using a local high performance computing (HPC) cluster, Penguin-On-Demand HPC cluster, and Amazon EC2 cluster. Parallel IAD was tested with up to 150 parallel cores using 1223 input datasets. It demonstrated linear scalability and the speedup was proportional to the number of parallel cores (up to 150x). Parallel MCML was tested with up to 1001 parallel cores using problem sizes of 104-109 photon packets. It demonstrated classical performance curves featuring communication overhead and performance saturation point. Optimal performance curve was derived for parallel MCML as a function of problem size. Typical speedup achieved for parallel MCML (up to 326x) demonstrated linear increase with problem size. Precision of MCML results was estimated in a series of tests - problem size of 106 photon packets was found optimal for calculations of total optical response and 108 photon packets for spatially-resolved results. The presented parallel versions of MCML and IAD programs are portable on multiple computing platforms. The parallel programs could significantly speed up the simulation for scientists and be utilized to their full potential in computing systems that are readily available without additional costs.

  11. Parallel Programming with MatlabMPI

    CERN Document Server

    Kepner, J V

    2001-01-01

    MatlabMPI is a Matlab implementation of the Message Passing Interface (MPI) standard and allows any Matlab program to exploit multiple processors. MatlabMPI currently implements the basic six functions that are the core of the MPI point-to-point communications standard. The key technical innovation of MatlabMPI is that it implements the widely used MPI ``look and feel'' on top of standard Matlab file I/O, resulting in an extremely compact (~100 lines) and ``pure'' implementation which runs anywhere Matlab runs. The performance has been tested on both shared and distributed memory parallel computers. MatlabMPI can match the bandwidth of C based MPI at large message sizes. A test image filtering application using MatlabMPI achieved a speedup of ~70 on a parallel computer.

  12. Array distribution in data-parallel programs

    Science.gov (United States)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

    1994-01-01

    We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.

  13. Parallel Programming Archetypes in Combinatorics and Optimization

    Science.gov (United States)

    1995-06-12

    A parallel programming archetype is a language independent program design strategy. We describe two archetypes in combinatorics and optimization...the systematic design of efficient sequential and parallel programs. The research whose results are presented in this document is part of the ongoing project on Parallel Programming Archetype.

  14. PRISMA/DB: A Parallel Main-Memory Relational DBMS

    NARCIS (Netherlands)

    Apers, Peter M.G.; Flokstra, Jan; van den Berg, Carel A.; Grefen, P.W.P.J.; Wilschut, A.N.; Kersten, Martin L.; van den Berg, C.A.

    1992-01-01

    PRISMA/DB, a full-fledged parallel, main memory relational database management system (DBMS) is described. PRISMA/DB's high performance is obtained by the use of parallelism for query processing and main memory storage of the entire database. A flexible architecture for experimenting with functional

  15. Graphics-Based Parallel Programming Tools

    Science.gov (United States)

    1991-09-01

    AD-A254 406 (9 FINAL REPORT DLECTF ’AUG 13 1992 Graphics-Based Parallel Programming Tools .p Janice E. Cuny, Principal Investigator Department of...suggest parallel (either because we use a parallel graph rewriting mechanism or because we apply our results to parallel programming ), we interpret it to...was to provide support for the ex- plicit representation of graphs for use within a parallel programming environ- ment. In our environment, we view a

  16. A Parallel Programming Model With Sequential Semantics

    Science.gov (United States)

    1996-01-01

    Parallel programming is more difficult than sequential programming in part because of the complexity of reasoning, testing, and debugging in the...context of concurrency. In the thesis, we present and investigate a parallel programming model that provides direct control of parallelism in a notation

  17. Four styles of parallel and net programming

    Institute of Scientific and Technical Information of China (English)

    Zhiwei XU; Yongqiang HE; Wei LIN; Li ZHA

    2009-01-01

    This paper reviews the programming landscape for parallel and network computing systems, focusing on four styles of concurrent programming models, and example languages/libraries. The four styles correspond to four scales of the targeted systems. At the smallest coprocessor scale, Single Instruction Multiple Thread (SIMT) and Compute Unified Device Architecture (CUDA) are considered. Transactional memory is discussed at the multicore or process scale. The MapReduce style is ex-amined at the datacenter scale. At the Internet scale, Grid Service Markup Language (GSML) is reviewed, which intends to integrate resources distributed across multiple dat-acenters.The four styles are concerned with and emphasize differ-ent issues, which are needed by systems at different scales. This paper discusses issues related to efficiency, ease of use, and expressiveness.

  18. Structured Parallel Programming Patterns for Efficient Computation

    CERN Document Server

    McCool, Michael; Robison, Arch

    2012-01-01

    Programming is now parallel programming. Much as structured programming revolutionized traditional serial programming decades ago, a new kind of structured programming, based on patterns, is relevant to parallel programming today. Parallel computing experts and industry insiders Michael McCool, Arch Robison, and James Reinders describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give detailed concrete examples using multiple programming models. Examples are primarily given using two of th

  19. About Parallel Programming: Paradigms, Parallel Execution and Collaborative Systems

    Directory of Open Access Journals (Sweden)

    Loredana MOCEAN

    2009-01-01

    Full Text Available In the last years, there were made efforts for delineation of a stabile and unitary frame, where the problems of logical parallel processing must find solutions at least at the level of imperative languages. The results obtained by now are not at the level of the made efforts. This paper wants to be a little contribution at these efforts. We propose an overview in parallel programming, parallel execution and collaborative systems.

  20. Parallel Programming in the Age of Ubiquitous Parallelism

    Science.gov (United States)

    Pingali, Keshav

    2014-04-01

    Multicore and manycore processors are now ubiquitous, but parallel programming remains as difficult as it was 30-40 years ago. During this time, our community has explored many promising approaches including functional and dataflow languages, logic programming, and automatic parallelization using program analysis and restructuring, but none of these approaches has succeeded except in a few niche application areas. In this talk, I will argue that these problems arise largely from the computation-centric foundations and abstractions that we currently use to think about parallelism. In their place, I will propose a novel data-centric foundation for parallel programming called the operator formulation in which algorithms are described in terms of actions on data. The operator formulation shows that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous even in complex, irregular graph applications such as mesh generation/refinement/partitioning and SAT solvers. Regular algorithms emerge as a special case of irregular ones, and many application-specific optimization techniques can be generalized to a broader context. The operator formulation also leads to a structural analysis of algorithms called TAO-analysis that provides implementation guidelines for exploiting parallelism efficiently. Finally, I will describe a system called Galois based on these ideas for exploiting amorphous data-parallelism on multicores and GPUs

  1. Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

    Science.gov (United States)

    Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

    2014-06-01

    This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.

  2. Reduced order multiport parallel and multidirectional neural associative memories.

    Science.gov (United States)

    Bhatti, Abdul Aziz

    2009-05-01

    This paper proposes multiport parallel and multidirectional intraconnected associative memories of outer product type with reduced interconnections. Some new reduced order memory architectures such as k-directional and k-port parallel memories are suggested. These architectures are, also, very suitable for implementation of spatio-temporal sequences and multiassociative memories. It is shown that in the proposed memory architectures, a substational reduction in interconnections is achieved if the actual length of original N-bit long vectors is subdivided into k sublengths. Using these sublengths, submemory matrices, T ( s ) or W ( s ), are computed, which are then intraconnected to form k-port parallel or k-directional memories. The subdivisions of N-bit long vectors into k sublengths save ((k-1) x 100) / k % of interconnections. It is shown, by means of an example, that more than 80% reduction in interconnections is achieved. Minimum limit in bits on k as well as maximum limit on subdivisions in k is determined. The topologies of reduced interconnectivity developed in this paper are symmetric in structure and can be used to scale up to larger systems. The underlying principal of construction, storage and retrieval processes of such associative memories has been analyzed. The effect of complexity of different levels of reduced interconnectivity on the quality of retrieval, signal to noise ratio, and storage capacity has been investigated. The model possesses analogies to biological neural structures and digital parallel port memories commonly used in parallel and multiprocessing systems.

  3. Genetic Parallel Programming: design and implementation.

    Science.gov (United States)

    Cheang, Sin Man; Leung, Kwong Sak; Lee, Kin Hong

    2006-01-01

    This paper presents a novel Genetic Parallel Programming (GPP) paradigm for evolving parallel programs running on a Multi-Arithmetic-Logic-Unit (Multi-ALU) Processor (MAP). The MAP is a Multiple Instruction-streams, Multiple Data-streams (MIMD), general-purpose register machine that can be implemented on modern Very Large-Scale Integrated Circuits (VLSIs) in order to evaluate genetic programs at high speed. For human programmers, writing parallel programs is more difficult than writing sequential programs. However, experimental results show that GPP evolves parallel programs with less computational effort than that of their sequential counterparts. It creates a new approach to evolving a feasible problem solution in parallel program form and then serializes it into a sequential program if required. The effectiveness and efficiency of GPP are investigated using a suite of 14 well-studied benchmark problems. Experimental results show that GPP speeds up evolution substantially.

  4. Requirements for Data-Parallel Programming Environments

    Science.gov (United States)

    1994-04-22

    fully automatic techniques would be insufficient by themselves to support general parallel programming , even in the limited domain of scientific...computation. In other words, in an effective parallel programming system, the programmer would have to provide additional information to help the system...convey an understanding of the tools and strategies that will be needed to adequately support efficient, machine-independent, data- parallel programming .

  5. Open-MP与并行程序设计%Open-MP and Parallel Programming

    Institute of Scientific and Technical Information of China (English)

    陈崚; 陈宏建; 秦玲

    2003-01-01

    The application programming interface Open-MP for the shared memory parallel computer system and its characteristics are illustrated. We also compare Open-MP with parallel programming tool MPI.To overcome the disadvantage of large overhead in Open-MP program,several optimization methods in Open-MP programming are presented to increase the efficiency of its execution.

  6. Characterizing and Mitigating Work Time Inflation in Task Parallel Programs

    Directory of Open Access Journals (Sweden)

    Stephen L. Olivier

    2013-01-01

    Full Text Available Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation – additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.

  7. Parallel Programming Environment for OpenMP

    Directory of Open Access Journals (Sweden)

    Insung Park

    2001-01-01

    Full Text Available We present our effort to provide a comprehensive parallel programming environment for the OpenMP parallel directive language. This environment includes a parallel programming methodology for the OpenMP programming model and a set of tools (Ursa Minor and InterPol that support this methodology. Our toolset provides automated and interactive assistance to parallel programmers in time-consuming tasks of the proposed methodology. The features provided by our tools include performance and program structure visualization, interactive optimization, support for performance modeling, and performance advising for finding and correcting performance problems. The presented evaluation demonstrates that our environment offers significant support in general parallel tuning efforts and that the toolset facilitates many common tasks in OpenMP parallel programming in an efficient manner.

  8. Parallel programming with PCN. Revision 1

    Energy Technology Data Exchange (ETDEWEB)

    Foster, I.; Tuecke, S.

    1991-12-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and C that allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. In includes both tutorial and reference material. It also presents the basic concepts that underly PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous FTP from Argonne National Laboratory in the directory pub/pcn at info.mcs.anl.gov (c.f. Appendix A).

  9. Parallel programming characteristics of a DSP-based parallel system

    Institute of Scientific and Technical Information of China (English)

    GAO Shu; GUO Qing-ping

    2006-01-01

    This paper firstly introduces the structure and working principle of DSP-based parallel system, parallel accelerating board and SHARC DSP chip. Then it pays attention to investigating the system's programming characteristics, especially the mode of communication, discussing how to design parallel algorithms and presenting a domain-decomposition-based complete multi-grid parallel algorithm with virtual boundary forecast (VBF) to solve a lot of large-scale and complicated heat problems. In the end, Mandelbrot Set and a non-linear heat transfer equation of ceramic/metal composite material are taken as examples to illustrate the implementation of the proposed algorithm. The results showed that the solutions are highly efficient and have linear speedup.

  10. Massively Parallel Finite Element Programming

    KAUST Repository

    Heister, Timo

    2010-01-01

    Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.

  11. Experiences in Data-Parallel Programming

    Directory of Open Access Journals (Sweden)

    Terry W. Clark

    1997-01-01

    Full Text Available To efficiently parallelize a scientific application with a data-parallel compiler requires certain structural properties in the source program, and conversely, the absence of others. A recent parallelization effort of ours reinforced this observation and motivated this correspondence. Specifically, we have transformed a Fortran 77 version of GROMOS, a popular dusty-deck program for molecular dynamics, into Fortran D, a data-parallel dialect of Fortran. During this transformation we have encountered a number of difficulties that probably are neither limited to this particular application nor do they seem likely to be addressed by improved compiler technology in the near future. Our experience with GROMOS suggests a number of points to keep in mind when developing software that may at some time in its life cycle be parallelized with a data-parallel compiler. This note presents some guidelines for engineering data-parallel applications that are compatible with Fortran D or High Performance Fortran compilers.

  12. Comparative Evaluation and Case Studies of Shared-Memory and Data-Parallel Execution Patterns

    Directory of Open Access Journals (Sweden)

    Xiaodong Zhang

    1999-01-01

    Full Text Available Shared‐memory and data‐parallel programming models are two important paradigms for scientific applications. Both models provide high‐level program abstractions, and simple and uniform views of network structures. The common features of the two models significantly simplify program coding and debugging for scientific applications. However, the underlining execution and overhead patterns are significantly different between the two models due to their programming constraints, and due to different and complex structures of interconnection networks and systems which support the two models. We performed this experimental study to present implications and comparisons of execution patterns on two commercial architectures. We implemented a standard electromagnetic simulation program (EM and a linear system solver using the shared‐memory model on the KSR‐1 and the data‐parallel model on the CM‐5. Our objectives are to examine the execution pattern changes required for an implementation transformation between the two models; to study memory access patterns; to address scalability issues; and to investigate relative costs and advantages/disadvantages of using the two models for scientific computations. Our results indicate that the EM program tends to become computation‐intensive in the KSR‐1 shared‐memory system, and memory‐demanding in the CM‐5 data‐parallel system when the systems and the problems are scaled. The EM program, a highly data‐parallel program performed extremely well, and the linear system solver, a highly control‐structured program suffered significantly in the data‐parallel model on the CM‐5. Our study provides further evidence that matching execution patterns of algorithms to parallel architectures would achieve better performance.

  13. Productive Parallel Programming: The PCN Approach

    Directory of Open Access Journals (Sweden)

    Ian Foster

    1992-01-01

    Full Text Available We describe the PCN programming system, focusing on those features designed to improve the productivity of scientists and engineers using parallel supercomputers. These features include a simple notation for the concise specification of concurrent algorithms, the ability to incorporate existing Fortran and C code into parallel applications, facilities for reusing parallel program components, a portable toolkit that allows applications to be developed on a workstation or small parallel computer and run unchanged on supercomputers, and integrated debugging and performance analysis tools. We survey representative scientific applications and identify problem classes for which PCN has proved particularly useful.

  14. A survey of parallel programming tools

    Science.gov (United States)

    Cheng, Doreen Y.

    1991-01-01

    This survey examines 39 parallel programming tools. Focus is placed on those tool capabilites needed for parallel scientific programming rather than for general computer science. The tools are classified with current and future needs of Numerical Aerodynamic Simulator (NAS) in mind: existing and anticipated NAS supercomputers and workstations; operating systems; programming languages; and applications. They are divided into four categories: suggested acquisitions, tools already brought in; tools worth tracking; and tools eliminated from further consideration at this time.

  15. Memory Benchmarks for SMP-Based High Performance Parallel Computers

    Energy Technology Data Exchange (ETDEWEB)

    Yoo, A B; de Supinski, B; Mueller, F; Mckee, S A

    2001-11-20

    As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even more complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.

  16. Using Bilingual Parallel Corpora in Translation Memory Systems

    Directory of Open Access Journals (Sweden)

    Hossein Keshtkar

    2012-09-01

    Full Text Available Automatic word alignment techniques commonly used in Translation Memory systems tend basically to work at single word level where there is a one to one correspondence between words in subsequences of the two languages. This, results in not being able to fully use subsentential repetitions like clauses, phrases and expressions. In this paper, using spaces between words, a search method named "space-based reduction search" is introduced. The main goal is to maximize the use of parallel corpus resources. We want to show that this search method can significantly enhance the chance of finding matches for subsequences of input sentences; hence applicable in a Sub-Sentential Translation Memory (SSTM system without running automatic alignment tools. Keywords: Sub-Sentential Translation Memory, Parallel corpus, Alignment

  17. Integrated Task and Data Parallel Programming

    Science.gov (United States)

    Grimshaw, A. S.

    1998-01-01

    This research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers 1995 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program. Additional 1995 Activities During the fall I collaborated

  18. Parallel programmable nonvolatile memory using ordinary static random access memory cells

    Science.gov (United States)

    Mizutani, Tomoko; Takeuchi, Kiyoshi; Saraya, Takuya; Shinohara, Hirofumi; Kobayashi, Masaharu; Hiramoto, Toshiro

    2017-04-01

    A technique of using an ordinary static random access memory (SRAM) array for a programmable nonvolatile (NV) memory is proposed. The parallel NV writing of the entire array is achieved by simply applying high-voltage stress to the power supply terminal, after storing inverted desired data in the static random access memory (SRAM) array. Successful 2 kbit NV writing is demonstrated using a device-matrix-array (DMA) test element group (TEG) fabricated by 0.18 µm technology.

  19. Center for Programming Models for Scalable Parallel Computing

    Energy Technology Data Exchange (ETDEWEB)

    John Mellor-Crummey

    2008-02-29

    Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from http://www.hipersoft.rice.edu/caf.

  20. Automatic Parallelization Tool: Classification of Program Code for Parallel Computing

    Directory of Open Access Journals (Sweden)

    Mustafa Basthikodi

    2016-04-01

    Full Text Available Performance growth of single-core processors has come to a halt in the past decade, but was re-enabled by the introduction of parallelism in processors. Multicore frameworks along with Graphical Processing Units empowered to enhance parallelism broadly. Couples of compilers are updated to developing challenges forsynchronization and threading issues. Appropriate program and algorithm classifications will have advantage to a great extent to the group of software engineers to get opportunities for effective parallelization. In present work we investigated current species for classification of algorithms, in that related work on classification is discussed along with the comparison of issues that challenges the classification. The set of algorithms are chosen which matches the structure with different issues and perform given task. We have tested these algorithms utilizing existing automatic species extraction toolsalong with Bones compiler. We have added functionalities to existing tool, providing a more detailed characterization. The contributions of our work include support for pointer arithmetic, conditional and incremental statements, user defined types, constants and mathematical functions. With this, we can retain significant data which is not captured by original speciesof algorithms. We executed new theories into the device, empowering automatic characterization of program code.

  1. Parallel programming with PCN. Revision 2

    Energy Technology Data Exchange (ETDEWEB)

    Foster, I.; Tuecke, S.

    1993-01-01

    PCN is a system for developing and executing parallel programs. It comprises a high-level programming language, tools for developing and debugging programs in this language, and interfaces to Fortran and Cthat allow the reuse of existing code in multilingual parallel programs. Programs developed using PCN are portable across many different workstations, networks, and parallel computers. This document provides all the information required to develop parallel programs with the PCN programming system. It includes both tutorial and reference material. It also presents the basic concepts that underlie PCN, particularly where these are likely to be unfamiliar to the reader, and provides pointers to other documentation on the PCN language, programming techniques, and tools. PCN is in the public domain. The latest version of both the software and this manual can be obtained by anonymous ftp from Argonne National Laboratory in the directory pub/pcn at info.mcs. ani.gov (cf. Appendix A). This version of this document describes PCN version 2.0, a major revision of the PCN programming system. It supersedes earlier versions of this report.

  2. Adaptive Dynamic Process Scheduling on Distributed Memory Parallel Computers

    Directory of Open Access Journals (Sweden)

    Wei Shu

    1994-01-01

    Full Text Available One of the challenges in programming distributed memory parallel machines is deciding how to allocate work to processors. This problem is particularly important for computations with unpredictable dynamic behaviors or irregular structures. We present a scheme for dynamic scheduling of medium-grained processes that is useful in this context. The adaptive contracting within neighborhood (ACWN is a dynamic, distributed, load-dependent, and scalable scheme. It deals with dynamic and unpredictable creation of processes and adapts to different systems. The scheme is described and contrasted with two other schemes that have been proposed in this context, namely the randomized allocation and the gradient model. The performance of the three schemes on an Intel iPSC/2 hypercube is presented and analyzed. The experimental results show that even though the ACWN algorithm incurs somewhat larger overhead than the randomized allocation, it achieves better performance in most cases due to its adaptiveness. Its feature of quickly spreading the work helps it outperform the gradient model in performance and scalability.

  3. Fencing direct memory access data transfers in a parallel active messaging interface of a parallel computer

    Science.gov (United States)

    Blocksome, Michael A.; Mamidala, Amith R.

    2013-09-03

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to segments of shared random access memory through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and a segment of shared memory; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  4. A Parallel Saturation Algorithm on Shared Memory Architectures

    Science.gov (United States)

    Ezekiel, Jonathan; Siminiceanu

    2007-01-01

    Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.

  5. Testing New Programming Paradigms with NAS Parallel Benchmarks

    Science.gov (United States)

    Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.

    2000-01-01

    Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage

  6. Nicotine inhibits memory CTL programming.

    Directory of Open Access Journals (Sweden)

    Zhifeng Sun

    Full Text Available Nicotine is the main tobacco component responsible for tobacco addiction and is used extensively in smoking and smoking cessation therapies. However, little is known about its effects on the immune system. We confirmed that multiple nicotinic receptors are expressed on mouse and human cytotoxic T lymphocytes (CTLs and demonstrated that nicotinic receptors on mouse CTLs are regulated during activation. Acute nicotine presence during activation increases primary CTL expansion in vitro, but impairs in vivo expansion after transfer and subsequent memory CTL differentiation, which reduces protection against subsequent pathogen challenges. Furthermore, nicotine abolishes the regulatory effect of rapamycin on memory CTL programming, which can be attributed to the fact that rapamycin enhances expression of nicotinic receptors. Interestingly, naïve CTLs from chronic nicotine-treated mice have normal memory programming, which is impaired by nicotine during activation in vitro. In conclusion, simultaneous exposure to nicotine and antigen during CTL activation negatively affects memory development.

  7. Nicotine inhibits memory CTL programming.

    Science.gov (United States)

    Sun, Zhifeng; Smyth, Kendra; Garcia, Karla; Mattson, Elliot; Li, Lei; Xiao, Zhengguo

    2013-01-01

    Nicotine is the main tobacco component responsible for tobacco addiction and is used extensively in smoking and smoking cessation therapies. However, little is known about its effects on the immune system. We confirmed that multiple nicotinic receptors are expressed on mouse and human cytotoxic T lymphocytes (CTLs) and demonstrated that nicotinic receptors on mouse CTLs are regulated during activation. Acute nicotine presence during activation increases primary CTL expansion in vitro, but impairs in vivo expansion after transfer and subsequent memory CTL differentiation, which reduces protection against subsequent pathogen challenges. Furthermore, nicotine abolishes the regulatory effect of rapamycin on memory CTL programming, which can be attributed to the fact that rapamycin enhances expression of nicotinic receptors. Interestingly, naïve CTLs from chronic nicotine-treated mice have normal memory programming, which is impaired by nicotine during activation in vitro. In conclusion, simultaneous exposure to nicotine and antigen during CTL activation negatively affects memory development.

  8. HPF: a data parallel programming interface for large-scale numerical simulations

    Energy Technology Data Exchange (ETDEWEB)

    Seo, Yoshiki; Suehiro, Kenji; Murai, Hitoshi [NEC Corp., Tokyo (Japan)

    1998-03-01

    HPF (High Performance Fortran) is a data parallel language designed for programming on distributed memory parallel systems. The first draft of HPF1.0 was defined in 1993 as a de facto standard language. Recently, relatively reliable HPF compilers have become available on several distributed memory parallel systems. Many projects to parallelize real world programs have started mainly in the U.S. and Europe, and the weak and strong points in the current HPF have been made clear. In this paper, major data transfer patterns required to parallelize numerical simulations, such as SHIFT, matrix transposition, reduction, GATHER/SCATTER and irregular communication, and the programming methods to implement them with HPF are described. The problems in the current HPF interface for developing efficient parallel programs and recent activities to deal with them is presented as well. (author)

  9. Parallel Breadth-First Search on Distributed Memory Systems

    Energy Technology Data Exchange (ETDEWEB)

    Computational Research Division; Buluc, Aydin; Madduri, Kamesh

    2011-04-15

    Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned par- allel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix- partitioning-based approach that mitigates parallel commu- nication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execu- tion regimes in which these approaches will be competitive, and we demonstrate extremely high performance on lead- ing distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny- Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.

  10. Parallel calculations on shared memory, NUMA-based computers using MATLAB

    Science.gov (United States)

    Krotkiewski, Marcin; Dabrowski, Marcin

    2014-05-01

    Achieving satisfactory computational performance in numerical simulations on modern computer architectures can be a complex task. Multi-core design makes it necessary to parallelize the code. Efficient parallelization on NUMA (Non-Uniform Memory Access) shared memory architectures necessitates explicit placement of the data in the memory close to the CPU that uses it. In addition, using more than 8 CPUs (~100 cores) requires a cluster solution of interconnected nodes, which involves (expensive) communication between the processors. It takes significant effort to overcome these challenges even when programming in low-level languages, which give the programmer full control over data placement and work distribution. Instead, many modelers use high-level tools such as MATLAB, which severely limit the optimization/tuning options available. Nonetheless, the advantage of programming simplicity and a large available code base can tip the scale in favor of MATLAB. We investigate whether MATLAB can be used for efficient, parallel computations on modern shared memory architectures. A common approach to performance optimization of MATLAB programs is to identify a bottleneck and migrate the corresponding code block to a MEX file implemented in, e.g. C. Instead, we aim at achieving a scalable parallel performance of MATLABs core functionality. Some of the MATLABs internal functions (e.g., bsxfun, sort, BLAS3, operations on vectors) are multi-threaded. Achieving high parallel efficiency of those may potentially improve the performance of significant portion of MATLABs code base. Since we do not have MATLABs source code, our performance tuning relies on the tools provided by the operating system alone. Most importantly, we use custom memory allocation routines, thread to CPU binding, and memory page migration. The performance tests are carried out on multi-socket shared memory systems (2- and 4-way Intel-based computers), as well as a Distributed Shared Memory machine with 96 CPU

  11. A portable implementation of ARPACK for distributed memory parallel architectures

    Energy Technology Data Exchange (ETDEWEB)

    Maschhoff, K.J.; Sorensen, D.C.

    1996-12-31

    ARPACK is a package of Fortran 77 subroutines which implement the Implicitly Restarted Arnoldi Method used for solving large sparse eigenvalue problems. A parallel implementation of ARPACK is presented which is portable across a wide range of distributed memory platforms and requires minimal changes to the serial code. The communication layers used for message passing are the Basic Linear Algebra Communication Subprograms (BLACS) developed for the ScaLAPACK project and Message Passing Interface(MPI).

  12. Shared Memory Parallelization of an Implicit ADI-type CFD Code

    Science.gov (United States)

    Hauser, Th.; Huang, P. G.

    1999-01-01

    A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.

  13. OpenCL parallel programming development cookbook

    CERN Document Server

    Tay, Raymond

    2013-01-01

    OpenCL Parallel Programming Development Cookbook will provide a set of advanced recipes that can be utilized to optimize existing code. This book is therefore ideal for experienced developers with a working knowledge of C/C++ and OpenCL.This book is intended for software developers who have often wondered what to do with that newly bought CPU or GPU they bought other than using it for playing computer games; this book is also for developers who have a working knowledge of C/C++ and who want to learn how to write parallel programs in OpenCL so that life isn't too boring.

  14. Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

    Energy Technology Data Exchange (ETDEWEB)

    Jin, Shuangshuang; Chen, Yousu; Wu, Di; Diao, Ruisheng; Huang, Zhenyu

    2015-12-09

    Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Message Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.

  15. Contributions to computational stereology and parallel programming

    DEFF Research Database (Denmark)

    Rasmusson, Allan

    rotator, even without the need for isotropic sections. To meet the need for computational power to perform image restoration of virtual tissue sections, parallel programming on GPUs has also been part of the project. This has lead to a significant change in paradigm for a previously developed surgical...

  16. Parallel Volunteer Learning during Youth Programs

    Science.gov (United States)

    Lesmeister, Marilyn K.; Green, Jeremy; Derby, Amy; Bothum, Candi

    2012-01-01

    Lack of time is a hindrance for volunteers to participate in educational opportunities, yet volunteer success in an organization is tied to the orientation and education they receive. Meeting diverse educational needs of volunteers can be a challenge for program managers. Scheduling a Volunteer Learning Track for chaperones that is parallel to a…

  17. An informal introduction to program transformation and parallel processors

    Energy Technology Data Exchange (ETDEWEB)

    Hopkins, K.W. [Southwest Baptist Univ., Bolivar, MO (United States)

    1994-08-01

    In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

  18. Advanced parallel programming models research and development opportunities.

    Energy Technology Data Exchange (ETDEWEB)

    Wen, Zhaofang.; Brightwell, Ronald Brian

    2004-07-01

    There is currently a large research and development effort within the high-performance computing community on advanced parallel programming models. This research can potentially have an impact on parallel applications, system software, and computing architectures in the next several years. Given Sandia's expertise and unique perspective in these areas, particularly on very large-scale systems, there are many areas in which Sandia can contribute to this effort. This technical report provides a survey of past and present parallel programming model research projects and provides a detailed description of the Partitioned Global Address Space (PGAS) programming model. The PGAS model may offer several improvements over the traditional distributed memory message passing model, which is the dominant model currently being used at Sandia. This technical report discusses these potential benefits and outlines specific areas where Sandia's expertise could contribute to current research activities. In particular, we describe several projects in the areas of high-performance networking, operating systems and parallel runtime systems, compilers, application development, and performance evaluation.

  19. Paging memory from random access memory to backing storage in a parallel computer

    Science.gov (United States)

    Archer, Charles J; Blocksome, Michael A; Inglett, Todd A; Ratterman, Joseph D; Smith, Brian E

    2013-05-21

    Paging memory from random access memory (`RAM`) to backing storage in a parallel computer that includes a plurality of compute nodes, including: executing a data processing application on a virtual machine operating system in a virtual machine on a first compute node; providing, by a second compute node, backing storage for the contents of RAM on the first compute node; and swapping, by the virtual machine operating system in the virtual machine on the first compute node, a page of memory from RAM on the first compute node to the backing storage on the second compute node.

  20. Concurrency-based approaches to parallel programming

    Science.gov (United States)

    Kale, L.V.; Chrisochoides, N.; Kohl, J.; Yelick, K.

    1995-01-01

    The inevitable transition to parallel programming can be facilitated by appropriate tools, including languages and libraries. After describing the needs of applications developers, this paper presents three specific approaches aimed at development of efficient and reusable parallel software for irregular and dynamic-structured problems. A salient feature of all three approaches in their exploitation of concurrency within a processor. Benefits of individual approaches such as these can be leveraged by an interoperability environment which permits modules written using different approaches to co-exist in single applications.

  1. Elevated cortisol at retrieval suppresses false memories in parallel with correct memories.

    Science.gov (United States)

    Diekelmann, Susanne; Wilhelm, Ines; Wagner, Ullrich; Born, Jan

    2011-04-01

    Retrieving a memory is a reconstructive process in which encoded representations can be changed and distorted. This process sometimes leads to the generation of "false memories," that is, when people remember events that, in fact, never happened. Such false memories typically represent a kind of "gist" being extracted from single encountered events. The stress hormone cortisol is known to substantially impair memory retrieval. Here, in a double-blind, placebo-controlled crossover design, we tested the effect of an intravenous cortisol infusion before retrieval testing on the occurrence of false memories and on recall of correct memories using a modified Deese-Roediger-McDermott paradigm. Subjects studied sets of abstract shapes, with each set being derived from one prototype that was not presented during learning. At retrieval taking place 9 hr after learning, subjects were presented with studied shapes, nonstudied shapes, and the prototypes, and had to indicate whether or not each shape had been presented at learning. Cortisol administration distinctly reduced susceptibility to false memories (i.e., false recognition of prototypes) and, in parallel, impaired retrieval of correct memories (i.e., correct recognition of studied shapes). Response bias as well as confidence ratings and remember/know/guess judgments were not affected. Our results support gist-based theories of false memory generation, assuming a simultaneous storage of the gist and specific details of an event. Cortisol, by a general impairing influence on retrieval operations, decreases, in parallel, retrieval of false (i.e., gist) and correct (i.e., specific) memories for the event.

  2. Towards HPC++: A Unified Approach to Parallel Programming in C++

    Science.gov (United States)

    1998-10-30

    Compositional C++ or CC++, is a general purpose parallel programming language designed to support a wide range of parallel programming styles. By...appropriate for parallelizing the range of applications that one would write in C++. CC++ supports the integration of different parallel programming styles

  3. Optimal Parallel Algorithm for the Knapsack Problem Without Memory Conflicts

    Institute of Scientific and Technical Information of China (English)

    Ken-Li Li; Ren-Fa Li; Qing-Hua Li

    2004-01-01

    The knapsack problem is well known to be NP-complete. Due to its importance in cryptosystem and in number theory, in the past two decades, much effort has been made in order to find techniques that could lead to practical algorithms with reasonable running time. This paper proposes a new parallel algorithm for the knapsack problem where the optimal merging algorithm is adopted. The proposed algorithm is based on an EREW-SIMD machine with shared memory. It is proved that the proposed algorithm is both optimal and the first without memory conflicts algorithm for the knapsack problem. The comparisons of algorithm performance show that it is an improvement over the past researches.

  4. Parallel k-means++ for Multiple Shared-Memory Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Mackey, Patrick S.; Lewis, Robert R.

    2016-09-22

    In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varying data sizes.

  5. Molecular dynamics simulation on a network of workstations using a machine-independent parallel programming language.

    OpenAIRE

    1991-01-01

    Molecular dynamics simulations investigate local and global motion in molecules. Several parallel computing approaches have been taken to attack the most computationally expensive phase of molecular simulations, the evaluation of long range interactions. This paper develops a straightforward but effective algorithm for molecular dynamics simulations using the machine-independent parallel programming language, Linda. The algorithm was run both on a shared memory parallel computer and on a netw...

  6. Automatic array alignment in data-parallel programs

    Science.gov (United States)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    FORTRAN 90 and other data-parallel languages express parallelism in the form of operations on data aggregates such as arrays. Misalignment of the operands of an array operation can reduce program performance on a distributed-memory parallel machine by requiring nonlocal data accesses. Determining array alignments that reduce communication is therefore a key issue in compiling such languages. We present a framework for the automatic determination of array alignments in array-based, data-parallel languages. Our language model handles array sectioning, reductions, spreads, transpositions, and masked operations. We decompose alignment functions into three constituents: axis, stride, and offset. For each of these subproblems, we show how to solve the alignment problem for a basic block of code, possibly containing common subexpressions. Alignments are generated for all array objects in the code, both named program variables and intermediate results. We assign computation to processors by virtue of explicit alignment of all temporaries; the resulting work assignment is in general better than that provided by the 'owner-computes' rule. Finally, we present some ideas for dealing with control flow, replication, and dynamic alignments that depend on loop induction variables.

  7. Specifying and Executing Optimizations for Parallel Programs

    Directory of Open Access Journals (Sweden)

    William Mansky

    2014-07-01

    Full Text Available Compiler optimizations, usually expressed as rewrites on program graphs, are a core part of all modern compilers. However, even production compilers have bugs, and these bugs are difficult to detect and resolve. The problem only becomes more complex when compiling parallel programs; from the choice of graph representation to the possibility of race conditions, optimization designers have a range of factors to consider that do not appear when dealing with single-threaded programs. In this paper we present PTRANS, a domain-specific language for formal specification of compiler transformations, and describe its executable semantics. The fundamental approach of PTRANS is to describe program transformations as rewrites on control flow graphs with temporal logic side conditions. The syntax of PTRANS allows cleaner, more comprehensible specification of program optimizations; its executable semantics allows these specifications to act as prototypes for the optimizations themselves, so that candidate optimizations can be tested and refined before going on to include them in a compiler. We demonstrate the use of PTRANS to state, test, and refine the specification of a redundant store elimination optimization on parallel programs.

  8. Timing-Sequence Testing of Parallel Programs

    Institute of Scientific and Technical Information of China (English)

    LIANG Yu; LI Shu; ZHANG Hui; HAN Chengde

    2000-01-01

    Testing of parallel programs involves two parts-testing of controlflow within the processes and testing of timing-sequence.This paper focuses on the latter, particularly on the timing-sequence of message-passing paradigms.Firstly the coarse-grained SYN-sequence model is built up to describe the execution of distributed programs. All of the topics discussed in this paper are based on it. The most direct way to test a program is to run it. A fault-free parallel program should be of both correct computing results and proper SYN-sequence. In order to analyze the validity of observed SYN-sequence, this paper presents the formal specification (Backus Normal Form) of the valid SYN-sequence. Till now there is little work about the testing coverage for distributed programs. Calculating the number of the valid SYN-sequences is the key to coverage problem, while the number of the valid SYN-sequences is terribly large and it is very hard to obtain the combination law among SYN-events. In order to resolve this problem, this paper proposes an efficient testing strategy-atomic SYN-event testing, which is to linearize the SYN-sequence (making it only consist of serial atomic SYN-events) first and then test each atomic SYN-event independently. This paper particularly provides the calculating formula about the number of the valid SYN-sequences for tree-topology atomic SYN-event (broadcast and combine). Furthermore,the number of valid SYN-sequences also,to some degree, mirrors the testability of parallel programs. Taking tree-topology atomic SYN-event as an example, this paper demonstrates the testability and communication speed of the tree-topology atomic SYN-event under different numbers of branches in order to achieve a more satisfactory tradeoff between testability and communication efficiency.

  9. Shared and Distributed Memory Parallel Security Analysis of Large-Scale Source Code and Binary Applications

    Energy Technology Data Exchange (ETDEWEB)

    Quinlan, D; Barany, G; Panas, T

    2007-08-30

    Many forms of security analysis on large scale applications can be substantially automated but the size and complexity can exceed the time and memory available on conventional desktop computers. Most commercial tools are understandably focused on such conventional desktop resources. This paper presents research work on the parallelization of security analysis of both source code and binaries within our Compass tool, which is implemented using the ROSE source-to-source open compiler infrastructure. We have focused on both shared and distributed memory parallelization of the evaluation of rules implemented as checkers for a wide range of secure programming rules, applicable to desktop machines, networks of workstations and dedicated clusters. While Compass as a tool focuses on source code analysis and reports violations of an extensible set of rules, the binary analysis work uses the exact same infrastructure but is less well developed into an equivalent final tool.

  10. The parallel processing of EGS4 code on distributed memory scalar parallel computer:Intel Paragon XP/S15-256

    Energy Technology Data Exchange (ETDEWEB)

    Takemiya, Hiroshi; Ohta, Hirofumi; Honma, Ichirou

    1996-03-01

    The parallelization of Electro-Magnetic Cascade Monte Carlo Simulation Code, EGS4 on distributed memory scalar parallel computer: Intel Paragon XP/S15-256 is described. EGS4 has the feature that calculation time for one incident particle is quite different from each other because of the dynamic generation of secondary particles and different behavior of each particle. Granularity for parallel processing, parallel programming model and the algorithm of parallel random number generation are discussed and two kinds of method, each of which allocates particles dynamically or statically, are used for the purpose of realizing high speed parallel processing of this code. Among four problems chosen for performance evaluation, the speedup factors for three problems have been attained to nearly 100 times with 128 processor. It has been found that when both the calculation time for each incident particles and its dispersion are large, it is preferable to use dynamic particle allocation method which can average the load for each processor. And it has also been found that when they are small, it is preferable to use static particle allocation method which reduces the communication overhead. Moreover, it is pointed out that to get the result accurately, it is necessary to use double precision variables in EGS4 code. Finally, the workflow of program parallelization is analyzed and tools for program parallelization through the experience of the EGS4 parallelization are discussed. (author).

  11. Parallel matrix transpose algorithms on distributed memory concurrent computers

    Energy Technology Data Exchange (ETDEWEB)

    Choi, J.; Walker, D.W. [Oak Ridge National Lab., TN (United States); Dongarra, J.J. [Oak Ridge National Lab., TN (United States)]|[Univ. of Tennessee, Knoxville, TN (United States). Dept. of Computer Science

    1993-10-01

    This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. It is assumed that the matrix is distributed over a P x Q processor template with a block scattered data distribution. P, Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD) of P and Q. If P and Q are relatively prime, the matrix transpose algorithm involves complete exchange communication. If P and Q are not relatively prime, processors are divided into GCD groups and the communication operations are overlapped for different groups of processors. Processors transpose GCD wrapped diagonal blocks simultaneously, and the matrix can be transposed with LCM/GCD steps, where LCM is the least common multiple of P and Q. The algorithms make use of non-blocking, point-to-point communication between processors. The use of nonblocking communication allows a processor to overlap the messages that it sends to different processors, thereby avoiding unnecessary synchronization. Combined with the matrix multiplication routine, C = A{center_dot}B, the algorithms are used to compute parallel multiplications of transposed matrices, C = A{sup T}{center_dot}B{sup T}, in the PUMMA package. Details of the parallel implementation of the algorithms are given, and results are presented for runs on the Intel Touchstone Delta computer.

  12. NavP: Structured and Multithreaded Distributed Parallel Programming

    Science.gov (United States)

    Pan, Lei

    2007-01-01

    We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.

  13. Parallelization and checkpointing of GPU applications through program transformation

    Energy Technology Data Exchange (ETDEWEB)

    Solano-Quinde, Lizandro Damian [Iowa State Univ., Ames, IA (United States)

    2012-01-01

    GPUs have emerged as a powerful tool for accelerating general-purpose applications. The availability of programming languages that makes writing general-purpose applications for running on GPUs tractable have consolidated GPUs as an alternative for accelerating general purpose applications. Among the areas that have benefited from GPU acceleration are: signal and image processing, computational fluid dynamics, quantum chemistry, and, in general, the High Performance Computing (HPC) Industry. In order to continue to exploit higher levels of parallelism with GPUs, multi-GPU systems are gaining popularity. In this context, single-GPU applications are parallelized for running in multi-GPU systems. Furthermore, multi-GPU systems help to solve the GPU memory limitation for applications with large application memory footprint. Parallelizing single-GPU applications has been approached by libraries that distribute the workload at runtime, however, they impose execution overhead and are not portable. On the other hand, on traditional CPU systems, parallelization has been approached through application transformation at pre-compile time, which enhances the application to distribute the workload at application level and does not have the issues of library-based approaches. Hence, a parallelization scheme for GPU systems based on application transformation is needed. Like any computing engine of today, reliability is also a concern in GPUs. GPUs are vulnerable to transient and permanent failures. Current checkpoint/restart techniques are not suitable for systems with GPUs. Checkpointing for GPU systems present new and interesting challenges, primarily due to the natural differences imposed by the hardware design, the memory subsystem architecture, the massive number of threads, and the limited amount of synchronization among threads. Therefore, a checkpoint/restart technique suitable for GPU systems is needed. The goal of this work is to exploit higher levels of parallelism and

  14. Assessing Programming Costs of Explicit Memory Localization on a Large Scale Shared Memory Multiprocessor

    Directory of Open Access Journals (Sweden)

    Silvio Picano

    1992-01-01

    Full Text Available We present detailed experimental work involving a commercially available large scale shared memory multiple instruction stream-multiple data stream (MIMD parallel computer having a software controlled cache coherence mechanism. To make effective use of such an architecture, the programmer is responsible for designing the program's structure to match the underlying multiprocessors capabilities. We describe the techniques used to exploit our multiprocessor (the BBN TC2000 on a network simulation program, showing the resulting performance gains and the associated programming costs. We show that an efficient implementation relies heavily on the user's ability to explicitly manage the memory system.

  15. Battling memory requirements of array programming through streaming

    DEFF Research Database (Denmark)

    Kristensen, Mads Ruben Burgdorff; Avery, James Emil; Blum, Troels;

    2016-01-01

    A barrier to efficient array programming, for example in Python/NumPy, is that algorithms written as pure array operations completely without loops, while most efficient on small input, can lead to explosions in memory use. The present paper presents a solution to this problem using array streaming......, implemented in the automatic parallelization high-performance framework Bohrium. This makes it possible to use array programming in Python/NumPy code directly, even when the apparent memory requirement exceeds the machine capacity, since the automatic streaming eliminates the temporary memory overhead...... streaming, yielding corresponding improvements in speed and utilization of GPGPU-cores. The streaming-enabled Bohrium effortlessly runs programs on input sizes much beyond sizes that crash on pure NumPy due to exhausting system memory....

  16. Profiling parallel Mercury programs with ThreadScope

    CERN Document Server

    Bone, Paul

    2011-01-01

    The behavior of parallel programs is even harder to understand than the behavior of sequential programs. Parallel programs may suffer from any of the performance problems affecting sequential programs, as well as from several problems unique to parallel systems. Many of these problems are quite hard (or even practically impossible) to diagnose without help from specialized tools. We present a proposal for a tool for profiling the parallel execution of Mercury programs, a proposal whose implementation we have already started. This tool is an adaptation and extension of the ThreadScope profiler that was first built to help programmers visualize the execution of parallel Haskell programs.

  17. Distributed Memory Programming on Many-Cores

    DEFF Research Database (Denmark)

    Berthold, Jost; Dieterle, Mischa; Lobachev, Oleg;

    2009-01-01

    Eden is a parallel extension of the lazy functional language Haskell providing dynamic process creation and automatic data exchange. As a Haskell extension, Eden takes a high-level approach to parallel programming and thereby simplifies parallel program development. The current implementation is ...

  18. Programming in Manticore, a Heterogenous Parallel Functional Language

    Science.gov (United States)

    Fluet, Matthew; Bergstrom, Lars; Ford, Nic; Rainey, Mike; Reppy, John; Shaw, Adam; Xiao, Yingqi

    The Manticore project is an effort to design and implement a new functional language for parallel programming. Unlike many earlier parallel languages, Manticore is a heterogeneous language that supports parallelism at multiple levels. Specifically, the Manticore language combines Concurrent ML-style explicit concurrency with fine-grain, implicitly threaded, parallel constructs. These lectures will introduce the Manticore language and explore a variety of programs written to take advantage of heterogeneous parallelism.

  19. Parallel Programming Strategies for Irregular Adaptive Applications

    Science.gov (United States)

    Biswas, Rupak; Biegel, Bryan (Technical Monitor)

    2001-01-01

    Achieving scalable performance for dynamic irregular applications is eminently challenging. Traditional message-passing approaches have been making steady progress towards this goal; however, they suffer from complex implementation requirements. The use of a global address space greatly simplifies the programming task, but can degrade the performance for such computations. In this work, we examine two typical irregular adaptive applications, Dynamic Remeshing and N-Body, under competing programming methodologies and across various parallel architectures. The Dynamic Remeshing application simulates flow over an airfoil, and refines localized regions of the underlying unstructured mesh. The N-Body experiment models two neighboring Plummer galaxies that are about to undergo a merger. Both problems demonstrate dramatic changes in processor workloads and interprocessor communication with time; thus, dynamic load balancing is a required component.

  20. Updating verbal and visuospatial working memory: Are the processes parallel?

    Institute of Scientific and Technical Information of China (English)

    YUE ZhenZhu; ZHANG Ming; ZHOU XiaoLin

    2008-01-01

    The current study compared the processes of updating verbal and visuospatial working memory (WM) and examined the roles of central executive and slave systems in working memory updating tasks, by changing the number of items updated simultaneously to manipulate the load on central executive. Ex-periment 1 used the verbal WM updating task, and the results validated the efficiency of the paradigm to manipulate the load on central executive. Experiment 2 employed the verbal WM updating task, with the articulatory suppression task to interfere with the phonological loop. The results supported the study by Morris and Jones, revealing that the central executive system played an important role in the updating component of verbal WM, while the phonological loop was responsible for the serial recall component. Experiment 3 employed the visuospatial WM updating task, with the spatial tapping task to interfere with the visuospatial sketchpad. The results suggested that the visuospatial sketchpad and the central execu-tive together dealt with the updating component, while the visuospatial sketchpad was responsible for the serial recall component by itself. These results are consistent with the findings that visuospatial sketch-pad has close links with central executive, while the phonological loop is separated from the central ex-ecutive. It suggests that updating visuospatial and verbal WM are not two parallel processes.

  1. VERIFICATION OF PARALLEL AUTOMATA-BASED PROGRAMS

    Directory of Open Access Journals (Sweden)

    M. A. Lukin

    2014-01-01

    Full Text Available The paper deals with an interactive method of automatic verification for parallel automata-based programs. The hierarchical state machines can be implemented in different threads and can interact with each other. Verification is done by means of Spin tool and includes automatic Promela model construction, conversion of LTL-formula to Spin format and counterexamples in terms of automata. Interactive verification gives the possibility to decrease verification time and increase the maximum size of verifiable programs. Considered method supports verification of the parallel system for hierarchical automata that interact with each other through messages and shared variables. The feature of automaton model is that each state machine is considered as a new data type and can have an arbitrary bounded number of instances. Each state machine in the system can run a different state machine in a new thread or have nested state machine. This method was implemented in the developed Stater tool. Stater shows correct operation for all test cases.

  2. Parallel functional programming in Sisal: Fictions, facts, and future

    Energy Technology Data Exchange (ETDEWEB)

    McGraw, J.R.

    1993-07-01

    This paper provides a status report on the progress of research and development on the functional language Sisal. This project focuses on providing a highly effective method of writing large scientific applications that can efficiently execute on a spectrum of different multiprocessors. The paper includes sections on the language definition, compilation strategies, and programming techniques intended for readers with little or no background with Sisal. The section on performance presents our most recent results on execution speed for shared-memory multiprocessors, our findings using Sisal to develop codes, and our experiences migrating the same source code to different machines. For large programs, the execution performance of Sisal (with minimal supporting advice from the programmer) usually exceeds that of the best available automatic, vector/parallel Fortran compilers. Our evidence also indicates that Sisal programs tend to be shorter in length, faster to write, and dearer to understand than equivalent algorithms in Fortran. The paper concludes with a substantial discussion of common criticisms of the language and our plans for addressing them. Most notably, efficient implementations for distributed memory machines are lacking; an issue we plan to remedy.

  3. Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

    Science.gov (United States)

    Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

    2015-09-01

    The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

  4. Programming distributed memory architectures using Kali

    Science.gov (United States)

    Mehrotra, Piyush; Vanrosendale, John

    1990-01-01

    Programming nonshared memory systems is more difficult than programming shared memory systems, in part because of the relatively low level of current programming environments for such machines. A new programming environment is presented, Kali, which provides a global name space and allows direct access to remote data values. In order to retain efficiency, Kali provides a system on annotations, allowing the user to control those aspects of the program critical to performance, such as data distribution and load balancing. The primitives and constructs provided by the language is described, and some of the issues raised in translating a Kali program for execution on distributed memory systems are also discussed.

  5. IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism

    Energy Technology Data Exchange (ETDEWEB)

    Lee, Seyong [ORNL; Vetter, Jeffrey S [ORNL

    2016-01-01

    We propose IMPACC, an MPI+OpenACC framework for heterogeneous accelerator clusters. IMPACC tightly integrates MPI and OpenACC, while exploiting the shared memory parallelism in the target system. IMPACC dynamically adapts the input MPI+OpenACC applications on the target heterogeneous accelerator clusters to fully exploit target system-specific features. IMPACC provides the programmers with the unified virtual address space, automatic NUMA-friendly task-device mapping, efficient integrated communication routines, seamless streamlining of asynchronous executions, and transparent memory sharing. We have implemented IMPACC and evaluated its performance using three heterogeneous accelerator systems, including Titan supercomputer. Results show that IMPACC can achieve easier programming, higher performance, and better scalability than the current MPI+OpenACC model.

  6. Parallelizing Sylvester-like operations on a distributed memory computer

    Energy Technology Data Exchange (ETDEWEB)

    Hu, D.Y.; Sorensen, D.C. [Rice Univ., Houston, TX (United States)

    1994-12-31

    Discretization of linear operators arising in applied mathematics often leads to matrices with the following structure: M(x) = (D {circle_times} A + B {circle_times} I{sub n} + V)x, where x {element_of} R{sup mn}, B, D {element_of} R{sup nxn}, A {element_of} R{sup mxm} and V {element_of} R{sup mnxmn}; both D and V are diagonal. For the notational convenience, the authors assume that both A and B are symmetric. All the results through this paper can be easily extended to the cases with general A and B. The linear operator on R{sup mn} defined above can be viewed as a generalization of the Sylvester operator: S(x) = (I{sub m} {circle_times} A + B {circle_times} I{sub n})x. The authors therefore refer to it as a Sylvester-like operator. The schemes discussed in this paper therefore also apply to Sylvester operator. In this paper, the authors present the SIMD scheme for parallelization of the Sylvester-like operator on a distributed memory computer. This scheme is designed to approach the best possible efficiency by avoiding unnecessary communication among processors.

  7. Automatic Parallelization and Optimization of Programs by Proof Rewriting

    NARCIS (Netherlands)

    Hurlin, C.; Palsberg, J.; Su, Z.

    2009-01-01

    We show how, given a program and its separation logic proof, one can parallelize and optimize this program and transform its proof simultaneously to obtain a proven parallelized and optimized program. To achieve this goal, we present new proof rules for generating proof trees and a rewrite system on

  8. What a Parallel Programming Language Has to Let You Say,

    Science.gov (United States)

    1984-09-01

    RD-fl147 854 WHAT A PARALLEL PROGRAMMING LANGUAGE HAS TO LET YOU SAY 1/1 (U) MASSACHUSETTS INST OF TECH CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB A...What a parallel programming language has to let I* you say 6. PERFORMING ORG. REPORT NUMNER io 8. CONTRACT OR GRANT NUSeR(8e) Alan Bawden/Philip E. Agre...Massachusetts Institute of Technology Artificial Intelligence Laboratory AI Memo 796 September 1984 What a parallel programming language has to let

  9. Using CLIPS in the domain of knowledge-based massively parallel programming

    Science.gov (United States)

    Dvorak, Jiri J.

    1994-01-01

    The Program Development Environment (PDE) is a tool for massively parallel programming of distributed-memory architectures. Adopting a knowledge-based approach, the PDE eliminates the complexity introduced by parallel hardware with distributed memory and offers complete transparency in respect of parallelism exploitation. The knowledge-based part of the PDE is realized in CLIPS. Its principal task is to find an efficient parallel realization of the application specified by the user in a comfortable, abstract, domain-oriented formalism. A large collection of fine-grain parallel algorithmic skeletons, represented as COOL objects in a tree hierarchy, contains the algorithmic knowledge. A hybrid knowledge base with rule modules and procedural parts, encoding expertise about application domain, parallel programming, software engineering, and parallel hardware, enables a high degree of automation in the software development process. In this paper, important aspects of the implementation of the PDE using CLIPS and COOL are shown, including the embedding of CLIPS with C++-based parts of the PDE. The appropriateness of the chosen approach and of the CLIPS language for knowledge-based software engineering are discussed.

  10. Wnt signaling inhibits CTL memory programming.

    Science.gov (United States)

    Xiao, Zhengguo; Sun, Zhifeng; Smyth, Kendra; Li, Lei

    2013-12-01

    Induction of functional CTLs is one of the major goals for vaccine development and cancer therapy. Inflammatory cytokines are critical for memory CTL generation. Wnt signaling is important for CTL priming and memory formation, but its role in cytokine-driven memory CTL programming is unclear. We found that wnt signaling inhibited IL-12-driven CTL activation and memory programming. This impaired memory CTL programming was attributed to up-regulation of eomes and down-regulation of T-bet. Wnt signaling suppressed the mTOR pathway during CTL activation, which was different to its effects on other cell types. Interestingly, the impaired memory CTL programming by wnt was partially rescued by mTOR inhibitor rapamycin. In conclusion, we found that crosstalk between wnt and the IL-12 signaling inhibits T-bet and mTOR pathways and impairs memory programming which can be recovered in part by rapamycin. In addition, direct inhibition of wnt signaling during CTL activation does not affect CTL memory programming. Therefore, wnt signaling may serve as a new tool for CTL manipulation in autoimmune diseases and immune therapy for certain cancers.

  11. Parallel Programming with Matrix Distributed Processing

    CERN Document Server

    Di Pierro, Massimo

    2005-01-01

    Matrix Distributed Processing (MDP) is a C++ library for fast development of efficient parallel algorithms. It constitues the core of FermiQCD. MDP enables programmers to focus on algorithms, while parallelization is dealt with automatically and transparently. Here we present a brief overview of MDP and examples of applications in Computer Science (Cellular Automata), Engineering (PDE Solver) and Physics (Ising Model).

  12. An interactive parallel programming environment applied in atmospheric science

    Science.gov (United States)

    vonLaszewski, G.

    1996-01-01

    This article introduces an interactive parallel programming environment (IPPE) that simplifies the generation and execution of parallel programs. One of the tasks of the environment is to generate message-passing parallel programs for homogeneous and heterogeneous computing platforms. The parallel programs are represented by using visual objects. This is accomplished with the help of a graphical programming editor that is implemented in Java and enables portability to a wide variety of computer platforms. In contrast to other graphical programming systems, reusable parts of the programs can be stored in a program library to support rapid prototyping. In addition, runtime performance data on different computing platforms is collected in a database. A selection process determines dynamically the software and the hardware platform to be used to solve the problem in minimal wall-clock time. The environment is currently being tested on a Grand Challenge problem, the NASA four-dimensional data assimilation system.

  13. A general purpose subroutine for fast fourier transform on a distributed memory parallel machine

    Science.gov (United States)

    Dubey, A.; Zubair, M.; Grosch, C. E.

    1992-01-01

    One issue which is central in developing a general purpose Fast Fourier Transform (FFT) subroutine on a distributed memory parallel machine is the data distribution. It is possible that different users would like to use the FFT routine with different data distributions. Thus, there is a need to design FFT schemes on distributed memory parallel machines which can support a variety of data distributions. An FFT implementation on a distributed memory parallel machine which works for a number of data distributions commonly encountered in scientific applications is presented. The problem of rearranging the data after computing the FFT is also addressed. The performance of the implementation on a distributed memory parallel machine Intel iPSC/860 is evaluated.

  14. A Parallel Vector Machine for the PM Programming Language

    Science.gov (United States)

    Bellerby, Tim

    2016-04-01

    PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using

  15. Parallel programming practical aspects, models and current limitations

    CERN Document Server

    Tarkov, Mikhail S

    2014-01-01

    Parallel programming is designed for the use of parallel computer systems for solving time-consuming problems that cannot be solved on a sequential computer in a reasonable time. These problems can be divided into two classes: 1. Processing large data arrays (including processing images and signals in real time)2. Simulation of complex physical processes and chemical reactions For each of these classes, prospective methods are designed for solving problems. For data processing, one of the most promising technologies is the use of artificial neural networks. Particles-in-cell method and cellular automata are very useful for simulation. Problems of scalability of parallel algorithms and the transfer of existing parallel programs to future parallel computers are very acute now. An important task is to optimize the use of the equipment (including the CPU cache) of parallel computers. Along with parallelizing information processing, it is essential to ensure the processing reliability by the relevant organization ...

  16. Method for programming a flash memory

    Science.gov (United States)

    Brosky, Alexander R.; Locke, William N.; Maher, Conrado M.

    2016-08-23

    A method of programming a flash memory is described. The method includes partitioning a flash memory into a first group having a first level of write-protection, a second group having a second level of write-protection, and a third group having a third level of write-protection. The write-protection of the second and third groups is disabled using an installation adapter. The third group is programmed using a Software Installation Device.

  17. Molecular programming of B cell memory.

    Science.gov (United States)

    McHeyzer-Williams, Michael; Okitsu, Shinji; Wang, Nathaniel; McHeyzer-Williams, Louise

    2011-12-09

    The development of high-affinity B cell memory is regulated through three separable phases, each involving antigen recognition by specific B cells and cognate T helper cells. Initially, antigen-primed B cells require cognate T cell help to gain entry into the germinal centre pathway to memory. Once in the germinal centre, B cells with variant B cell receptors must access antigens and present them to germinal centre T helper cells to enter long-lived memory B cell compartments. Following antigen recall, memory B cells require T cell help to proliferate and differentiate into plasma cells. A recent surge of information - resulting from dynamic B cell imaging in vivo and the elucidation of T follicular helper cell programmes - has reshaped the conceptual landscape surrounding the generation of memory B cells. In this Review, we integrate this new information about each phase of antigen-specific B cell development to describe the newly unravelled molecular dynamics of memory B cell programming.

  18. Professional Parallel Programming with C# Master Parallel Extensions with NET 4

    CERN Document Server

    Hillar, Gastón

    2010-01-01

    Expert guidance for those programming today's dual-core processors PCs As PC processors explode from one or two to now eight processors, there is an urgent need for programmers to master concurrent programming. This book dives deep into the latest technologies available to programmers for creating professional parallel applications using C#, .NET 4, and Visual Studio 2010. The book covers task-based programming, coordination data structures, PLINQ, thread pools, asynchronous programming model, and more. It also teaches other parallel programming techniques, such as SIMD and vectorization.Teach

  19. Selecting Simulation Models when Predicting Parallel Program Behaviour

    OpenAIRE

    Broberg, Magnus; Lundberg, Lars; Grahn, Håkan

    2002-01-01

    The use of multiprocessors is an important way to increase the performance of a supercom-puting program. This means that the program has to be parallelized to make use of the multi-ple processors. The parallelization is unfortunately not an easy task. Development tools supporting parallel programs are important. Further, it is the customer that decides the number of processors in the target machine, and as a result the developer has to make sure that the pro-gram runs efficiently on any numbe...

  20. Parallelism in a Main-Memory System: The Performance of PRISMA/DB

    NARCIS (Netherlands)

    Wilschut, A.N.; Flokstra, Jan; Apers, Peter M.G.

    1992-01-01

    his paper evaluates the performance of the parallel, main-memory DBMS, PRISMA/DB. First, an abstract architecture for parallel query execution is presented. A performance model for the execution of simple relational operations on this architecture is developed. The parameters in the model are set us

  1. A Performance Analysis Tool for PVM Parallel Programs

    Institute of Scientific and Technical Information of China (English)

    Chen Wang; Yin Liu; Changjun Jiang; Zhaoqing Zhang

    2004-01-01

    In this paper,we introduce the design and implementation of ParaVT,which is a visual performance analysis and parallel debugging tool.In ParaVT,we propose an automated instrumentation mechanism. Based on this mechanism,ParaVT automatically analyzes the performance bottleneck of parallel applications and provides a visual user interface to monitor and analyze the performance of parallel programs.In addition ,it also supports certain extensions.

  2. Performance Evaluation of Parallel Message Passing and Thread Programming Model on Multicore Architectures

    CERN Document Server

    Hasta, D T

    2010-01-01

    The current trend of multicore architectures on shared memory systems underscores the need of parallelism. While there are some programming model to express parallelism, thread programming model has become a standard to support these system such as OpenMP, and POSIX threads. MPI (Message Passing Interface) which remains the dominant model used in high-performance computing today faces this challenge. Previous version of MPI which is MPI-1 has no shared memory concept, and Current MPI version 2 which is MPI-2 has a limited support for shared memory systems. In this research, MPI-2 version of MPI will be compared with OpenMP to see how well does MPI perform on multicore / SMP (Symmetric Multiprocessor) machines. Comparison between OpenMP for thread programming model and MPI for message passing programming model will be conducted on multicore shared memory machine architectures to see who has a better performance in terms of speed and throughput. Application used to assess the scalability of the evaluated parall...

  3. Architectural Adaptability in Parallel Programming via Control Abstraction

    Science.gov (United States)

    1991-01-01

    Technical Report 359 January 1991 Abstract Parallel programming involves finding the potential parallelism in an application, choos - ing an...during the development of this paper. 34 References [Albert et ai, 1988] Eugene Albert, Kathleen Knobe, Joan D. Lukas, and Guy L. Steele, Jr

  4. Memory, reasoning and categorization: Parallels and common mechanisms

    Directory of Open Access Journals (Sweden)

    BRETT eHAYES

    2014-06-01

    Full Text Available Traditionally, memory, reasoning and categorization have been treated as separate components of human cognition. We challenge this distinction, arguing that there is broad scope for crossover between the methods and theories developed for each task. The links between memory and reasoning are illustrated in a review of two lines of research. The first takes theoretical ideas (two-process accounts and methodological tools (signal detection analysis, receiver operating characteristic curves from memory research and applies them to important issues in reasoning research: relations between induction and deduction, and the belief bias effect. The second line of research introduces a task in which subjects make either memory or reasoning judgments for the same set of stimuli. Other than broader generalization for reasoning than memory, the results were similar for the two tasks, across a variety of experimental stimuli and manipulations. It was possible to simultaneously explain performance on both tasks within a single cognitive architecture, based on exemplar-based comparisons of similarity. The final sections explore evidence for empirical and processing links between inductive reasoning and categorization and between categorization and recognition. An important implication is that progress in all three of these fields will be expedited by further investigation of the many commonalities between these tasks.

  5. Memory, reasoning, and categorization: parallels and common mechanisms.

    Science.gov (United States)

    Hayes, Brett K; Heit, Evan; Rotello, Caren M

    2014-01-01

    Traditionally, memory, reasoning, and categorization have been treated as separate components of human cognition. We challenge this distinction, arguing that there is broad scope for crossover between the methods and theories developed for each task. The links between memory and reasoning are illustrated in a review of two lines of research. The first takes theoretical ideas (two-process accounts) and methodological tools (signal detection analysis, receiver operating characteristic curves) from memory research and applies them to important issues in reasoning research: relations between induction and deduction, and the belief bias effect. The second line of research introduces a task in which subjects make either memory or reasoning judgments for the same set of stimuli. Other than broader generalization for reasoning than memory, the results were similar for the two tasks, across a variety of experimental stimuli and manipulations. It was possible to simultaneously explain performance on both tasks within a single cognitive architecture, based on exemplar-based comparisons of similarity. The final sections explore evidence for empirical and processing links between inductive reasoning and categorization and between categorization and recognition. An important implication is that progress in all three of these fields will be expedited by further investigation of the many commonalities between these tasks.

  6. Improved Parallel Three-List Algorithm for the Knapsack Problem without Memory Conflicts

    Institute of Scientific and Technical Information of China (English)

    Pan Jun; Li Kenli; Li Qinghua

    2006-01-01

    Based on the two-list algorithm and the parallel three-list algorithm, an improved parallel three-list algorithm for knapsack problem is proposed, in which the method of divide and conquer, and parallel merging without memory conflicts are adopted. To find a solution for the n-element knapsack problem, the proposed algorithm needs O(23n/8) time when O(23n/8) shared memory units and O(2n/4) processors are available. The comparisons between the proposed algorithm and 10 existing algorithms show that the improved parallel three-list algorithm is the first exclusive-read exclusive-write (EREW) parallel algorithm that can solve the knapsack instances in less than O(2n/2) time when the available hardware resource is smaller than O(2n/2), and hence is an improved result over the past researches.

  7. Detection of And—Parallelism in Logic Programs

    Institute of Scientific and Technical Information of China (English)

    黄志毅; 胡守仁

    1990-01-01

    In this paper,we present a detection technique of and-parallelism in logic programs.The detection consists of three phases:analysis of entry modes,derivation of exit modes and determination of execution graph expressions.Compared with other techniques[2,4,5],our approach with the compile-time program-level data-dependence analysis of logic programs,can efficiently exploit and-parallelism in logic programs.Two precompilers,based on our technique and DeGroot's approach[3] respectively,have been implemented in SES-PIM system[12],Through compiling and running some typical benchmarks in SES-PIM,we conclude that our technique can,in most cases,exploit as much and-parallelism as the dynamic approach[13]does under“produces-consumer”scheme,and needs less dynamic overhead while exploiting more and parallelism than DeGroot's approach does.

  8. Speedup properties of phases in the execution profile of distributed parallel programs

    Energy Technology Data Exchange (ETDEWEB)

    Carlson, B.M. [Toronto Univ., ON (Canada). Computer Systems Research Institute; Wagner, T.D.; Dowdy, L.W. [Vanderbilt Univ., Nashville, TN (United States). Dept. of Computer Science; Worley, P.H. [Oak Ridge National Lab., TN (United States)

    1992-08-01

    The execution profile of a distributed-memory parallel program specifies the number of busy processors as a function of time. Periods of homogeneous processor utilization are manifested in many execution profiles. These periods can usually be correlated with the algorithms implemented in the underlying parallel code. Three families of methods for smoothing execution profile data are presented. These approaches simplify the problem of detecting end points of periods of homogeneous utilization. These periods, called phases, are then examined in isolation, and their speedup characteristics are explored. A specific workload executed on an Intel iPSC/860 is used for validation of the techniques described.

  9. Support for non-locking parallel reception of packets belonging to a single memory reception FIFO

    Science.gov (United States)

    Chen, Dong [Yorktown Heights, NY; Heidelberger, Philip [Yorktown Heights, NY; Salapura, Valentina [Yorktown Heights, NY; Senger, Robert M [Yorktown Heights, NY; Steinmacher-Burow, Burkhard [Boeblingen, DE; Sugawara, Yutaka [Yorktown Heights, NY

    2011-01-27

    A method and apparatus for distributed parallel messaging in a parallel computing system. A plurality of DMA engine units are configured in a multiprocessor system to operate in parallel, one DMA engine unit for transferring a current packet received at a network reception queue to a memory location in a memory FIFO (rmFIFO) region of a memory. A control unit implements logic to determine whether any prior received packet destined for that rmFIFO is still in a process of being stored in the associated memory by another DMA engine unit of the plurality, and prevent the one DMA engine unit from indicating completion of storing the current received packet in the reception memory FIFO (rmFIFO) until all prior received packets destined for that rmFIFO are completely stored by the other DMA engine units. Thus, there is provided non-locking support so that multiple packets destined for a single rmFIFO are transferred and stored in parallel to predetermined locations in a memory.

  10. Integrated Task And Data Parallel Programming: Language Design

    Science.gov (United States)

    Grimshaw, Andrew S.; West, Emily A.

    1998-01-01

    his research investigates the combination of task and data parallel language constructs within a single programming language. There are an number of applications that exhibit properties which would be well served by such an integrated language. Examples include global climate models, aircraft design problems, and multidisciplinary design optimization problems. Our approach incorporates data parallel language constructs into an existing, object oriented, task parallel language. The language will support creation and manipulation of parallel classes and objects of both types (task parallel and data parallel). Ultimately, the language will allow data parallel and task parallel classes to be used either as building blocks or managers of parallel objects of either type, thus allowing the development of single and multi-paradigm parallel applications. 1995 Research Accomplishments In February I presented a paper at Frontiers '95 describing the design of the data parallel language subset. During the spring I wrote and defended my dissertation proposal. Since that time I have developed a runtime model for the language subset. I have begun implementing the model and hand-coding simple examples which demonstrate the language subset. I have identified an astrophysical fluid flow application which will validate the data parallel language subset. 1996 Research Agenda Milestones for the coming year include implementing a significant portion of the data parallel language subset over the Legion system. Using simple hand-coded methods, I plan to demonstrate (1) concurrent task and data parallel objects and (2) task parallel objects managing both task and data parallel objects. My next steps will focus on constructing a compiler and implementing the fluid flow application with the language. Concurrently, I will conduct a search for a real-world application exhibiting both task and data parallelism within the same program m. Additional 1995 Activities During the fall I collaborated

  11. Tetanic stimulation of cortical networks induces parallel memory

    NARCIS (Netherlands)

    Veenendaal, van Tamar; Witteveen, Tim; Feber, le Joost; Akay, M

    2013-01-01

    The mechanisms behind memory have been studied mainly in artificial neural networks. Several mechanisms have been proposed, but it remains unclear yet if and how these findings can be translated to biological networks. Here we unravel part of the mechanism by showing that cultured neuronal networks

  12. CLUSTEREASY:A Program for Simulating Scalar Field Evolution on Parallel Computers

    CERN Document Server

    Felder, Gary N

    2007-01-01

    We describe a new, parallel programming version of the scalar field simulation program LATTICEEASY. The new C++ program, CLUSTEREASY, can simulate arbitrary scalar field models on distributed-memory clusters. The speed and memory requirements scale well with the number of processors. As with the serial version of LATTICEEASY, CLUSTEREASY can run simulations in one, two, or three dimensions, with or without expansion of the universe, with customizable parameters and output. The program and its full documentation are available on the LATTICEEASY website at http://www.science.smith.edu/departments/Physics/fstaff/gfelder/latticeeasy/. In this paper we provide a brief overview of what CLUSTEREASY does and the ways in which it does and doesn't differ from the serial version of LATTICEEASY.

  13. A new shared-memory programming paradigm for molecular dynamics simulations on the Intel Paragon

    Energy Technology Data Exchange (ETDEWEB)

    D`Azevedo, E.F.; Romine, C.H.

    1994-12-01

    This report describes the use of shared memory emulation with DOLIB (Distributed Object Library) to simplify parallel programming on the Intel Paragon. A molecular dynamics application is used as an example to illustrate the use of the DOLIB shared memory library. SOTON-PAR, a parallel molecular dynamics code with explicit message-passing using a Lennard-Jones 6-12 potential, is rewritten using DOLIB primitives. The resulting code has no explicit message primitives and resembles a serial code. The new code can perform dynamic load balancing and achieves better performance than the original parallel code with explicit message-passing.

  14. A New Shared-Memory Programming Paradigm for Molecular Dynamics Simulations on the Intel Paragon

    Energy Technology Data Exchange (ETDEWEB)

    D' Azevedo, E.F.

    1995-01-01

    This report describes the use of shared memory emulation with DOLIB (Distributed Object Library) to simplify parallel programming on the Intel Paragon. A molecular dynamics application is used as an example to illustrate the use of the DOLIB shared memory library. SOTON PAR, a parallel molecular dynamics code with explicit message-passing using a Lennard-Jones 6-12 potential, is rewritten using DOLIB primitives. The resulting code has no explicit message primitives and resembles a serial code. The new code can perform dynamic load balancing and achieves better performance than the original parallel code with explicit message-passing.

  15. Scheduling Constrained-Deadline Sporadic Parallel Tasks Considering Memory Contention

    Science.gov (United States)

    2014-10-01

    Spaccamela, L. Stougie, and A. Wiese . A generalized parallel task model for recurrent real-time processes. In Proc. of RTSS, 2012. [4] V. Bonifaci, A...Marchetti-Spaccamela, S. Stiller, and A. Wiese . Feasibility analysis in the sporadic DAG model. In Proc. of ECRTS, 2013. [5] H. S. Chwa, J. Lee, K.-M

  16. Program Transformation to Identify List-Based Parallel Skeletons

    Directory of Open Access Journals (Sweden)

    Venkatesh Kannan

    2016-07-01

    Full Text Available Algorithmic skeletons are used as building-blocks to ease the task of parallel programming by abstracting the details of parallel implementation from the developer. Most existing libraries provide implementations of skeletons that are defined over flat data types such as lists or arrays. However, skeleton-based parallel programming is still very challenging as it requires intricate analysis of the underlying algorithm and often uses inefficient intermediate data structures. Further, the algorithmic structure of a given program may not match those of list-based skeletons. In this paper, we present a method to automatically transform any given program to one that is defined over a list and is more likely to contain instances of list-based skeletons. This facilitates the parallel execution of a transformed program using existing implementations of list-based parallel skeletons. Further, by using an existing transformation called distillation in conjunction with our method, we produce transformed programs that contain fewer inefficient intermediate data structures.

  17. Programming Robots with Associative Memories

    Energy Technology Data Exchange (ETDEWEB)

    Touzet, C

    1999-07-10

    Today, there are several drawbacks that impede the necessary and much needed use of robot learning techniques in real applications. First, the time needed to achieve the synthesis of any behavior is prohibitive. Second, the robot behavior during the learning phase is "by definition" bad, it may even be dangerous. Third, except within the lazy learning approach, a new behavior implies a new learning phase. We propose in this paper to use self-organizing maps to encode the non explicit model of the robot-world interaction sampled by the lazy memory, and then generate a robot behavior by means of situations to be achieved, i.e., points on the self-organizing maps. Any behavior can instantaneously be synthesized by the definition of a goal situation. Its performance will be minimal (not evidently bad) and will improve by the mere repetition of the behavior.

  18. Development of massively parallel quantum chemistry program SMASH

    Energy Technology Data Exchange (ETDEWEB)

    Ishimura, Kazuya [Department of Theoretical and Computational Molecular Science, Institute for Molecular Science 38 Nishigo-Naka, Myodaiji, Okazaki, Aichi 444-8585 (Japan)

    2015-12-31

    A massively parallel program for quantum chemistry calculations SMASH was released under the Apache License 2.0 in September 2014. The SMASH program is written in the Fortran90/95 language with MPI and OpenMP standards for parallelization. Frequently used routines, such as one- and two-electron integral calculations, are modularized to make program developments simple. The speed-up of the B3LYP energy calculation for (C{sub 150}H{sub 30}){sub 2} with the cc-pVDZ basis set (4500 basis functions) was 50,499 on 98,304 cores of the K computer.

  19. Web Based Parallel Programming Workshop for Undergraduate Education.

    Science.gov (United States)

    Marcus, Robert L.; Robertson, Douglass

    Central State University (Ohio), under a contract with Nichols Research Corporation, has developed a World Wide web based workshop on high performance computing entitled "IBN SP2 Parallel Programming Workshop." The research is part of the DoD (Department of Defense) High Performance Computing Modernization Program. The research…

  20. Protocol-Based Verification of Message-Passing Parallel Programs

    DEFF Research Database (Denmark)

    López-Acosta, Hugo-Andrés; Eduardo R. B. Marques, Eduardo R. B.; Martins, Francisco;

    2015-01-01

    a protocol language based on a dependent type system for message-passing parallel programs, which includes various communication operators, such as point-to-point messages, broadcast, reduce, array scatter and gather. For the verification of a program against a given protocol, the protocol is first...

  1. Certifying Concurrent Programs Using Transactional Memory

    Institute of Scientific and Technical Information of China (English)

    Long Li; Yu Zhang; Yi-Yun Chen; Yong Li

    2009-01-01

    Transactional memory (TM) is a new promising concurrency-control mechanism that can avoid many of the pitfalls of the traditional lock-based techniques. TM systems handle data races between threads automatically so that programmers do not have to reason about the interaction of threads manually. TM provides a programming model that may make the development of multi-threaded programs easier. Much work has been done to explore the various implementation strategies of TM systems and to achieve better performance, but little has been done on how to formally reason about programs using TM and how to make sure that such reasoning is sound. In this paper, we focus on the semantics of transactional memory and present a proof-carrying code (PCC) system for reasoning about programs using TM . We formalize our reasoning with respect to the TM semantics, prove its soundness, and use examples to demonstrate its effectiveness.

  2. Accelerate Performance on the Parallel Programming Super Highway

    Science.gov (United States)

    2010-04-01

    barriers  associated with  parallel   programming Dataflow languages ought to be considered along with  traditional (imperative) programming solutions 2...dasymptotic con ition (3 GHz) Moore’s Law may still be valid, but the Law of  Thermodynamics is also valid Parallel   Programming  options exist, but...languages can address some major  challenges associated with  parallel   programming Many dataflow languages exist today, and should be  considered along  ith

  3. On program restructuring, scheduling, and communication for parallel processor systems

    Energy Technology Data Exchange (ETDEWEB)

    Polychronopoulos, Constantine D.

    1986-08-01

    This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented. 69 refs., 74 figs., 14 tabs.

  4. Parallel programming of saccades during natural scene viewing: evidence from eye movement positions.

    Science.gov (United States)

    Wu, Esther X W; Gilani, Syed Omer; van Boxtel, Jeroen J A; Amihai, Ido; Chua, Fook Kee; Yen, Shih-Cheng

    2013-10-24

    Previous studies have shown that saccade plans during natural scene viewing can be programmed in parallel. This evidence comes mainly from temporal indicators, i.e., fixation durations and latencies. In the current study, we asked whether eye movement positions recorded during scene viewing also reflect parallel programming of saccades. As participants viewed scenes in preparation for a memory task, their inspection of the scene was suddenly disrupted by a transition to another scene. We examined whether saccades after the transition were invariably directed immediately toward the center or were contingent on saccade onset times relative to the transition. The results, which showed a dissociation in eye movement behavior between two groups of saccades after the scene transition, supported the parallel programming account. Saccades with relatively long onset times (>100 ms) after the transition were directed immediately toward the center of the scene, probably to restart scene exploration. Saccades with short onset times (programming of saccades during scene viewing. Additionally, results from the analyses of intersaccadic intervals were also consistent with the parallel programming hypothesis.

  5. Fencing network direct memory access data transfers in a parallel active messaging interface of a parallel computer

    Energy Technology Data Exchange (ETDEWEB)

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-07-14

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  6. Fencing network direct memory access data transfers in a parallel active messaging interface of a parallel computer

    Energy Technology Data Exchange (ETDEWEB)

    Blocksome, Michael A.; Mamidala, Amith R.

    2015-07-07

    Fencing direct memory access (`DMA`) data transfers in a parallel active messaging interface (`PAMI`) of a parallel computer, the PAMI including data communications endpoints, each endpoint including specifications of a client, a context, and a task, the endpoints coupled for data communications through the PAMI and through DMA controllers operatively coupled to a deterministic data communications network through which the DMA controllers deliver data communications deterministically, including initiating execution through the PAMI of an ordered sequence of active DMA instructions for DMA data transfers between two endpoints, effecting deterministic DMA data transfers through a DMA controller and the deterministic data communications network; and executing through the PAMI, with no FENCE accounting for DMA data transfers, an active FENCE instruction, the FENCE instruction completing execution only after completion of all DMA instructions initiated prior to execution of the FENCE instruction for DMA data transfers between the two endpoints.

  7. The parallel programming of voluntary and reflexive saccades.

    Science.gov (United States)

    Walker, Robin; McSorley, Eugene

    2006-06-01

    A novel two-step paradigm was used to investigate the parallel programming of consecutive, stimulus-elicited ('reflexive') and endogenous ('voluntary') saccades. The mean latency of voluntary saccades, made following the first reflexive saccades in two-step conditions, was significantly reduced compared to that of voluntary saccades made in the single-step control trials. The latency of the first reflexive saccades was modulated by the requirement to make a second saccade: first saccade latency increased when a second voluntary saccade was required in the opposite direction to the first saccade, and decreased when a second saccade was required in the same direction as the first reflexive saccade. A second experiment confirmed the basic effect and also showed that a second reflexive saccade may be programmed in parallel with a first voluntary saccade. The results support the view that voluntary and reflexive saccades can be programmed in parallel on a common motor map.

  8. Parallelization of the LBG Vector Quantization Algorithm for Shared Memory Systems

    CERN Document Server

    Annaji, Rajashekar

    2009-01-01

    This paper proposes a parallel approach for the Vector Quantization (VQ) problem in image processing. VQ deals with codebook generation from the input training data set and replacement of any arbitrary data with the nearest codevector. Most of the efforts in VQ have been directed towards designing parallel search algorithms for the codebook, and little has hitherto been done in evolving a parallelized procedure to obtain an optimum codebook. This parallel algorithm addresses the problem of designing an optimum codebook using the traditional LBG type of vector quantization algorithm for shared memory systems and for the efficient usage of parallel processors. Using the codebook formed from a training set, any arbitrary input data is replaced with the nearest codevector from the codebook. The effectiveness of the proposed algorithm is indicated.

  9. A new parallel-vector finite element analysis software on distributed-memory computers

    Science.gov (United States)

    Qin, Jiangning; Nguyen, Duc T.

    1993-01-01

    A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.

  10. Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers.

    Science.gov (United States)

    Church, Philip C; Goscinski, Andrzej; Holt, Kathryn; Inouye, Michael; Ghoting, Amol; Makarychev, Konstantin; Reumann, Matthias

    2011-01-01

    The challenge of comparing two or more genomes that have undergone recombination and substantial amounts of segmental loss and gain has recently been addressed for small numbers of genomes. However, datasets of hundreds of genomes are now common and their sizes will only increase in the future. Multiple sequence alignment of hundreds of genomes remains an intractable problem due to quadratic increases in compute time and memory footprint. To date, most alignment algorithms are designed for commodity clusters without parallelism. Hence, we propose the design of a multiple sequence alignment algorithm on massively parallel, distributed memory supercomputers to enable research into comparative genomics on large data sets. Following the methodology of the sequential progressiveMauve algorithm, we design data structures including sequences and sorted k-mer lists on the IBM Blue Gene/P supercomputer (BG/P). Preliminary results show that we can reduce the memory footprint so that we can potentially align over 250 bacterial genomes on a single BG/P compute node. We verify our results on a dataset of E.coli, Shigella and S.pneumoniae genomes. Our implementation returns results matching those of the original algorithm but in 1/2 the time and with 1/4 the memory footprint for scaffold building. In this study, we have laid the basis for multiple sequence alignment of large-scale datasets on a massively parallel, distributed memory supercomputer, thus enabling comparison of hundreds instead of a few genome sequences within reasonable time.

  11. Molecular dynamics simulation on a network of workstations using a machine-independent parallel programming language.

    Science.gov (United States)

    Shifman, M A; Windemuth, A; Schulten, K; Miller, P L

    1992-04-01

    Molecular dynamics simulations investigate local and global motion in molecules. Several parallel computing approaches have been taken to attack the most computationally expensive phase of molecular simulations, the evaluation of long range interactions. This paper reviews these approaches and develops a straightforward but effective algorithm using the machine-independent parallel programming language, Linda. The algorithm was run both on a shared memory parallel computer and on a network of high performance Unix workstations. Performance benchmarks were performed on both systems using two proteins. This algorithm offers a portable cost-effective alternative for molecular dynamics simulations. In view of the increasing numbers of networked workstations, this approach could help make molecular dynamics simulations more easily accessible to the research community.

  12. Basic design of parallel computational program for probabilistic structural analysis

    Energy Technology Data Exchange (ETDEWEB)

    Kaji, Yoshiyuki; Arai, Taketoshi [Japan Atomic Energy Research Inst., Tokai, Ibaraki (Japan). Tokai Research Establishment; Gu, Wenwei; Nakamura, Hitoshi

    1999-06-01

    In our laboratory, for `development of damage evaluation method of structural brittle materials by microscopic fracture mechanics and probabilistic theory` (nuclear computational science cross-over research) we examine computational method related to super parallel computation system which is coupled with material strength theory based on microscopic fracture mechanics for latent cracks and continuum structural model to develop new structural reliability evaluation methods for ceramic structures. This technical report is the review results regarding probabilistic structural mechanics theory, basic terms of formula and program methods of parallel computation which are related to principal terms in basic design of computational mechanics program. (author)

  13. Full Parallel Implementation of an All-Electron Four-Component Dirac-Kohn-Sham Program.

    Science.gov (United States)

    Rampino, Sergio; Belpassi, Leonardo; Tarantelli, Francesco; Storchi, Loriano

    2014-09-09

    A full distributed-memory implementation of the Dirac-Kohn-Sham (DKS) module of the program BERTHA (Belpassi et al., Phys. Chem. Chem. Phys. 2011, 13, 12368-12394) is presented, where the self-consistent field (SCF) procedure is replicated on all the parallel processes, each process working on subsets of the global matrices. The key feature of the implementation is an efficient procedure for switching between two matrix distribution schemes, one (integral-driven) optimal for the parallel computation of the matrix elements and another (block-cyclic) optimal for the parallel linear algebra operations. This approach, making both CPU-time and memory scalable with the number of processors used, virtually overcomes at once both time and memory barriers associated with DKS calculations. Performance, portability, and numerical stability of the code are illustrated on the basis of test calculations on three gold clusters of increasing size, an organometallic compound, and a perovskite model. The calculations are performed on a Beowulf and a BlueGene/Q system.

  14. Parallel Implementation of the PHOENIX Generalized Stellar Atmosphere Program; 2, Wavelength Parallelization

    CERN Document Server

    Baron, E A; Hauschildt, Peter H.

    1997-01-01

    We describe an important addition to the parallel implementation of our generalized NLTE stellar atmosphere and radiative transfer computer program PHOENIX. In a previous paper in this series we described data and task parallel algorithms we have developed for radiative transfer, spectral line opacity, and NLTE opacity and rate calculations. These algorithms divided the work spatially or by spectral lines, that is distributing the radial zones, individual spectral lines, or characteristic rays among different processors and employ, in addition task parallelism for logically independent functions (such as atomic and molecular line opacities). For finite, monotonic velocity fields, the radiative transfer equation is an initial value problem in wavelength, and hence each wavelength point depends upon the previous one. However, for sophisticated NLTE models of both static and moving atmospheres needed to accurately describe, e.g., novae and supernovae, the number of wavelength points is very large (200,000--300,0...

  15. Parallel Libraries to support High-Level Programming

    DEFF Research Database (Denmark)

    Larsen, Morten Nørgaard

    so is not a simple task and for many non-computer scientists, like chemists and physicists writing programs for simulating their experiments, the task can easily become overwhelming. During the last decades, a lot of research efforts have been put into how to create tools that will simplify writing......The development of computer architectures during the last ten years have forced programmers to move towards writing parallel programs instead of sequential ones. The homogenous multi-core architectures from the major CPU producers like Intel and AMD has led this trend, but the introduction......, the general increase in the usage of graphic cards for general-purpose programming (GPGPU) have meant that programmers today must be able to write parallel programs that cannot only utilize small number computational cores but perhaps hundreds or even thousands. However, most programmers will agree that doing...

  16. MELD: A Logical Approach to Distributed and Parallel Programming

    Science.gov (United States)

    2012-03-01

    USA: ACM, 1974, pp. 249–264. [14] M. Isard , M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Distributed data-parallel programs from sequential...TR-2006-140. [Online]. Available: http://budiu.info/work/eurosys07.pdf [15] Y. Yu, M. Isard , D. Fetterly, M. Budiu, Ú . Erlingsson, P. K. Gunda

  17. Computational cost estimates for parallel shared memory isogeometric multi-frontal solvers

    KAUST Repository

    Woźniak, Maciej

    2014-06-01

    In this paper we present computational cost estimates for parallel shared memory isogeometric multi-frontal solvers. The estimates show that the ideal isogeometric shared memory parallel direct solver scales as O( p2log(N/p)) for one dimensional problems, O(Np2) for two dimensional problems, and O(N4/3p2) for three dimensional problems, where N is the number of degrees of freedom, and p is the polynomial order of approximation. The computational costs of the shared memory parallel isogeometric direct solver are compared with those corresponding to the sequential isogeometric direct solver, being the latest equal to O(N p2) for the one dimensional case, O(N1.5p3) for the two dimensional case, and O(N2p3) for the three dimensional case. The shared memory version significantly reduces both the scalability in terms of N and p. Theoretical estimates are compared with numerical experiments performed with linear, quadratic, cubic, quartic, and quintic B-splines, in one and two spatial dimensions. © 2014 Elsevier Ltd. All rights reserved.

  18. Enhanced computation method of topological smoothing on shared memory parallel machines

    Directory of Open Access Journals (Sweden)

    Mahmoudi Ramzi

    2011-01-01

    Full Text Available Abstract To prepare images for better segmentation, we need preprocessing applications, such as smoothing, to reduce noise. In this paper, we present an enhanced computation method for smoothing 2D object in binary case. Unlike existing approaches, proposed method provides a parallel computation and better memory management, while preserving the topology (number of connected components of the original image by using homotopic transformations defined in the framework of digital topology. We introduce an adapted parallelization strategy called split, distribute and merge (SDM strategy which allows efficient parallelization of a large class of topological operators. To achieve a good speedup and better memory allocation, we cared about task scheduling and managing. Distributed work during smoothing process is done by a variable number of threads. Tests on 2D grayscale image (512*512, using shared memory parallel machine (SMPM with 8 CPU cores (2× Xeon E5405 running at frequency of 2 GHz, showed an enhancement of 5.2 with cache success rate of 70%.

  19. Modelling parallel programs and multiprocessor architectures with AXE

    Science.gov (United States)

    Yan, Jerry C.; Fineman, Charles E.

    1991-01-01

    AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.

  20. Real-time topological image smoothing on shared memory parallel machines

    Science.gov (United States)

    Mahmoudi, Ramzi; Akil, Mohamed

    2011-03-01

    Smoothing filter is the method of choice for image preprocessing and pattern recognition. We present a new concurrent method for smoothing 2D object in binary case. Proposed method provides a parallel computation while preserving the topology by using homotopic transformations. We introduce an adapted parallelization strategy called split, distribute and merge (SDM) strategy which allows efficient parallelization of a large class of topological operators including, mainly, smoothing, skeletonization, and watershed algorithms. To achieve a good speedup, we cared about task scheduling. Distributed work during smoothing process is done by a variable number of threads. Tests on 2D binary image (512*512), using shared memory parallel machine (SMPM) with 8 CPU cores (2× Xeon E5405 running at frequency of 2 GHz), showed an enhancement of 5.2 thus a cadency of 32 images per second is achieved.

  1. A Large-Grain Parallel Programming Environment for Non-Programmers

    OpenAIRE

    Lewis, Ted

    1994-01-01

    1994 International Conference on Parallel Processing Banger is a parallel programming environment used by non-professional programmers to write explicitly parallel large-grain parallel programs. The goals of Banger are: 1. extreme ease of use, 2. immediate feedback, and 3. machine-independence. Banger is based on three principles: 1. separation of parallel programming-in-the-large from sequential programming-in-the-small, 2. separation of programming environment from target machine ...

  2. Heterogeneous Multicore Parallel Programming for Graphics Processing Units

    Directory of Open Access Journals (Sweden)

    Francois Bodin

    2009-01-01

    Full Text Available Hybrid parallel multicore architectures based on graphics processing units (GPUs can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.

  3. Efficient Thread Labeling for Monitoring Programs with Nested Parallelism

    Science.gov (United States)

    Ha, Ok-Kyoon; Kim, Sun-Sook; Jun, Yong-Kee

    It is difficult and cumbersome to detect data races occurred in an execution of parallel programs. Any on-the-fly race detection techniques using Lamport's happened-before relation needs a thread labeling scheme for generating unique identifiers which maintain logical concurrency information for the parallel threads. NR labeling is an efficient thread labeling scheme for the fork-join program model with nested parallelism, because its efficiency depends only on the nesting depth for every fork and join operation. This paper presents an improved NR labeling, called e-NR labeling, in which every thread generates its label by inheriting the pointer to its ancestor list from the parent threads or by updating the pointer in a constant amount of time and space. This labeling is more efficient than the NR labeling, because its efficiency does not depend on the nesting depth for every fork and join operation. Some experiments were performed with OpenMP programs having nesting depths of three or four and maximum parallelisms varying from 10,000 to 1,000,000. The results show that e-NR is 5 times faster than NR labeling and 4.3 times faster than OS labeling in the average time for creating and maintaining the thread labels. In average space required for labeling, it is 3.5 times smaller than NR labeling and 3 times smaller than OS labeling.

  4. Feedback Driven Annotation and Refactoring of Parallel Programs

    DEFF Research Database (Denmark)

    Larsen, Per

    This thesis combines programmer knowledge and feedback to improve modeling and optimization of software. The research is motivated by two observations. First, there is a great need for automatic analysis of software for embedded systems - to expose and model parallelism inherent in programs. Second......, some program properties are beyond reach of such analysis for theoretical and practical reasons - but can be described by programmers. Three aspects are explored. The first is annotation of the source code. Two annotations are introduced. These allow more accurate modeling of parallelism...... are not effective unless programmers are told how and when they are benecial. A prototype compilation feedback system was developed in collaboration with IBM Haifa Research Labs. It reports issues that prevent further analysis to the programmer. Performance evaluation shows that three programs performes signicantly...

  5. Sisal 3.2: functional language for scientific parallel programming

    Science.gov (United States)

    Kasyanov, Victor

    2013-05-01

    Sisal 3.2 is a new input language of system of functional programming (SFP) which is under development at the Institute of Informatics Systems in Novosibirsk as an interactive visual environment for supporting of scientific parallel programming. This paper contains an overview of Sisal 3.2 and a description of its new features compared with previous versions of the SFP input language such as the multidimensional array support, new abstractions like parametric types and generalised procedures, more flexible user-defined reductions, improved interoperability with other programming languages and specification of several optimising source text annotations.

  6. Programming Massively Parallel Architectures using MARTE: a Case Study

    CERN Document Server

    Rodrigues, Wendell; Dekeyser, Jean-Luc

    2011-01-01

    Nowadays, several industrial applications are being ported to parallel architectures. These applications take advantage of the potential parallelism provided by multiple core processors. Many-core processors, especially the GPUs(Graphics Processing Unit), have led the race of floating-point performance since 2003. While the performance improvement of general- purpose microprocessors has slowed significantly, the GPUs have continued to improve relentlessly. As of 2009, the ratio between many-core GPUs and multicore CPUs for peak floating-point calculation throughput is about 10 times. However, as parallel programming requires a non-trivial distribution of tasks and data, developers find it hard to implement their applications effectively. Aiming to improve the use of many-core processors, this work presents an case-study using UML and MARTE profile to specify and generate OpenCL code for intensive signal processing applications. Benchmark results show us the viability of the use of MDE approaches to generate G...

  7. High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation

    Energy Technology Data Exchange (ETDEWEB)

    Peterka, Tom; Morozov, Dmitriy; Phillips, Carolyn

    2014-11-14

    Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization; but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared-memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the subdomains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.

  8. A Screen Space GPGPU Surface LIC Algorithm for Distributed Memory Data Parallel Sort Last Rendering Infrastructures

    Energy Technology Data Exchange (ETDEWEB)

    Loring, Burlen; Karimabadi, Homa; Rortershteyn, Vadim

    2014-07-01

    The surface line integral convolution(LIC) visualization technique produces dense visualization of vector fields on arbitrary surfaces. We present a screen space surface LIC algorithm for use in distributed memory data parallel sort last rendering infrastructures. The motivations for our work are to support analysis of datasets that are too large to fit in the main memory of a single computer and compatibility with prevalent parallel scientific visualization tools such as ParaView and VisIt. By working in screen space using OpenGL we can leverage the computational power of GPUs when they are available and run without them when they are not. We address efficiency and performance issues that arise from the transformation of data from physical to screen space by selecting an alternate screen space domain decomposition. We analyze the algorithm's scaling behavior with and without GPUs on two high performance computing systems using data from turbulent plasma simulations.

  9. Towards Interactive Visual Exploration of Parallel Programs using a Domain-Specific Language

    KAUST Repository

    Klein, Tobias

    2016-04-19

    The use of GPUs and the massively parallel computing paradigm have become wide-spread. We describe a framework for the interactive visualization and visual analysis of the run-time behavior of massively parallel programs, especially OpenCL kernels. This facilitates understanding a program\\'s function and structure, finding the causes of possible slowdowns, locating program bugs, and interactively exploring and visually comparing different code variants in order to improve performance and correctness. Our approach enables very specific, user-centered analysis, both in terms of the recording of the run-time behavior and the visualization itself. Instead of having to manually write instrumented code to record data, simple code annotations tell the source-to-source compiler which code instrumentation to generate automatically. The visualization part of our framework then enables the interactive analysis of kernel run-time behavior in a way that can be very specific to a particular problem or optimization goal, such as analyzing the causes of memory bank conflicts or understanding an entire parallel algorithm.

  10. Performance Evaluation Methodologies and Tools for Massively Parallel Programs

    Science.gov (United States)

    Yan, Jerry C.; Sarukkai, Sekhar; Tucker, Deanne (Technical Monitor)

    1994-01-01

    The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessors. However, without effective means to monitor (and analyze) program execution, tuning the performance of parallel programs becomes exponentially difficult as program complexity and machine size increase. The recent introduction of performance tuning tools from various supercomputer vendors (Intel's ParAide, TMC's PRISM, CSI'S Apprentice, and Convex's CXtrace) seems to indicate the maturity of performance tool technologies and vendors'/customers' recognition of their importance. However, a few important questions remain: What kind of performance bottlenecks can these tools detect (or correct)? How time consuming is the performance tuning process? What are some important technical issues that remain to be tackled in this area? This workshop reviews the fundamental concepts involved in analyzing and improving the performance of parallel and heterogeneous message-passing programs. Several alternative strategies will be contrasted, and for each we will describe how currently available tuning tools (e.g., AIMS, ParAide, PRISM, Apprentice, CXtrace, ATExpert, Pablo, IPS-2)) can be used to facilitate the process. We will characterize the effectiveness of the tools and methodologies based on actual user experiences at NASA Ames Research Center. Finally, we will discuss their limitations and outline recent approaches taken by vendors and the research community to address them.

  11. Scientific programming on massively parallel processor CP-PACS

    Energy Technology Data Exchange (ETDEWEB)

    Boku, Taisuke [Tsukuba Univ., Ibaraki (Japan). Inst. of Information Sciences and Electronics

    1998-03-01

    The massively parallel processor CP-PACS takes various problems of calculation physics as the object, and it has been designed so that its architecture has been devised to do various numerical processings. In this report, the outline of the CP-PACS and the example of programming in the Kernel CG benchmark in NAS Parallel Benchmarks, version 1, are shown, and the pseudo vector processing mechanism and the parallel processing tuning of scientific and technical computation utilizing the three-dimensional hyper crossbar net, which are two great features of the architecture of the CP-PACS are described. As for the CP-PACS, the PUs based on RISC processor and added with pseudo vector processor are used. Pseudo vector processing is realized as the loop processing by scalar command. The features of the connection net of PUs are explained. The algorithm of the NPB version 1 Kernel CG is shown. The part that takes the time for processing most in the main loop is the product of matrix and vector (matvec), and the parallel processing of the matvec is explained. The time for the computation by the CPU is determined. As the evaluation of the performance, the evaluation of the time for execution, the short vector processing of pseudo vector processor based on slide window, and the comparison with other parallel computers are reported. (K.I.)

  12. Computational cost of isogeometric multi-frontal solvers on parallel distributed memory machines

    KAUST Repository

    Woźniak, Maciej

    2015-02-01

    This paper derives theoretical estimates of the computational cost for isogeometric multi-frontal direct solver executed on parallel distributed memory machines. We show theoretically that for the Cp-1 global continuity of the isogeometric solution, both the computational cost and the communication cost of a direct solver are of order O(log(N)p2) for the one dimensional (1D) case, O(Np2) for the two dimensional (2D) case, and O(N4/3p2) for the three dimensional (3D) case, where N is the number of degrees of freedom and p is the polynomial order of the B-spline basis functions. The theoretical estimates are verified by numerical experiments performed with three parallel multi-frontal direct solvers: MUMPS, PaStiX and SuperLU, available through PETIGA toolkit built on top of PETSc. Numerical results confirm these theoretical estimates both in terms of p and N. For a given problem size, the strong efficiency rapidly decreases as the number of processors increases, becoming about 20% for 256 processors for a 3D example with 1283 unknowns and linear B-splines with C0 global continuity, and 15% for a 3D example with 643 unknowns and quartic B-splines with C3 global continuity. At the same time, one cannot arbitrarily increase the problem size, since the memory required by higher order continuity spaces is large, quickly consuming all the available memory resources even in the parallel distributed memory version. Numerical results also suggest that the use of distributed parallel machines is highly beneficial when solving higher order continuity spaces, although the number of processors that one can efficiently employ is somehow limited.

  13. Programming N-Cubes with a Graphical Parallel Programming Environment Versus an Extended Sequential Language.

    Science.gov (United States)

    1986-11-01

    parallel programming environment and language Poker. Our example programs, an implementation of a Cholesky algorithm for a banded matrix, were written in both languages and compiled into object codes that ran on the Cosmic Cube. However the program written in Poker is shorter, faster and easier to write, easier to debug, and portable without changes to other parallel computer architectures. The Poker program was slower than the program written directly in Cosmic Cube C, however the experiments provided insights into changes that make Poker programs nearly as fast.

  14. On the utility of threads for data parallel programming

    Science.gov (United States)

    Fahringer, Thomas; Haines, Matthew; Mehrotra, Piyush

    1995-01-01

    Threads provide a useful programming model for asynchronous behavior because of their ability to encapsulate units of work that can then be scheduled for execution at runtime, based on the dynamic state of a system. Recently, the threaded model has been applied to the domain of data parallel scientific codes, and initial reports indicate that the threaded model can produce performance gains over non-threaded approaches, primarily through the use of overlapping useful computation with communication latency. However, overlapping computation with communication is possible without the benefit of threads if the communication system supports asynchronous primitives, and this comparison has not been made in previous papers. This paper provides a critical look at the utility of lightweight threads as applied to data parallel scientific programming.

  15. Final Report: Center for Programming Models for Scalable Parallel Computing

    Energy Technology Data Exchange (ETDEWEB)

    Mellor-Crummey, John [William Marsh Rice University

    2011-09-13

    As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.

  16. VPC - A Proposal for a Vector Parallel C Programming Language.

    Science.gov (United States)

    1987-10-30

    181 B. Kernighan and D. Ritchie. Th~e C Programming Language. Prentice-11all, 1978. [91 B. Kernighan and R. Pike. The Unix Programming Environment...designed to be an extended version of the C language as defined by Kernighan and Ritchie (Ref. 8). Rather than taking the approach of extending...basis. Unix is a trademark of AT&T Bell Laboratories. e .’r % 7% The Vector Parallel C Language 3 tion calls that activate the FX/8’s proprietary

  17. LDRD final report on massively-parallel linear programming : the parPCx system.

    Energy Technology Data Exchange (ETDEWEB)

    Parekh, Ojas (Emory University, Atlanta, GA); Phillips, Cynthia Ann; Boman, Erik Gunnar

    2005-02-01

    This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runs on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and

  18. Relation of Physical Activity to Memory Functioning in Older Adults: The Memory Workout Program.

    Science.gov (United States)

    Rebok, George W.; Plude, Dana J.

    2001-01-01

    The Memory Workout, a CD-ROM program designed to help older adults increase changes in physical and cognitive activity influencing memory, was tested with 24 subjects. Results revealed a significant relationship between exercise time, exercise efficacy, and cognitive function, as well as interest in improving memory and physical activity.…

  19. Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs

    Directory of Open Access Journals (Sweden)

    Fuad Abujarad

    2009-12-01

    Full Text Available Previous work has shown that there are two major complexity barriers in the synthesis of fault-tolerant distributed programs: (1 generation of fault-span, the set of states reachable in the presence of faults, and (2 resolving deadlock states, from where the program has no outgoing transitions. Of these, the former closely resembles with model checking and, hence, techniques for efficient verification are directly applicable to it. Hence, we focus on expediting the latter with the use of multi-core technology. We present two approaches for parallelization by considering different design choices. The first approach is based on the computation of equivalence classes of program transitions (called group computation that are needed due to the issue of distribution (i.e., inability of processes to atomically read and write all program variables. We show that in most cases the speedup of this approach is close to the ideal speedup and in some cases it is superlinear. The second approach uses traditional technique of partitioning deadlock states among multiple threads. However, our experiments show that the speedup for this approach is small. Consequently, our analysis demonstrates that a simple approach of parallelizing the group computation is likely to be the effective method for using multi-core computing in the context of deadlock resolution.

  20. Transactional Memory

    OpenAIRE

    Grahn, Håkan

    2010-01-01

    Current and future processor generations are based on multicore architectures where the performance increase comes from an increasing number of cores on a chip. In order to utilize the performance potential of multicore architectures the programs also need to be parallel, but writing parallel programs is a non-trivial task. Transactional memory tries to ease parallel program development by providing atomic and isolated execution of code sequences, enabling software composability and protected...

  1. Advanced Programming Platform for efficient use of Data Parallel Hardware

    CERN Document Server

    Cabellos, Luis

    2012-01-01

    Graphics processing units (GPU) had evolved from a specialized hardware capable to render high quality graphics in games to a commodity hardware for effective processing blocks of data in a parallel schema. This evolution is particularly interesting for scientific groups, which traditionally use mainly CPU as a work horse, and now can profit of the arrival of GPU hardware to HPC clusters. This new GPU hardware promises a boost in peak performance, but it is not trivial to use. In this article a programming platform designed to promote a direct use of this specialized hardware is presented. This platform includes a visual editor of parallel data flows and it is oriented to the execution in distributed clusters with GPUs. Examples of application in two characteristic problems, Fast Fourier Transform and Image Compression, are also shown.

  2. Automatic Performance Debugging of SPMD-style Parallel Programs

    CERN Document Server

    Liu, Xu; Zhan, Kunlin; Shi, Weisong; Yuan, Lin; Meng, Dan; Wang, Lei

    2011-01-01

    The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any apriori knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on a basis o...

  3. Exploring Shared-Memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems

    KAUST Repository

    Mudigere, Dheevatsa

    2015-05-01

    In this work, we revisit the 1999 Gordon Bell Prize winning PETSc-FUN3D aerodynamics code, extending it with highly-tuned shared-memory parallelization and detailed performance analysis on modern highly parallel architectures. An unstructured-grid implicit flow solver, which forms the backbone of computational aerodynamics, poses particular challenges due to its large irregular working sets, unstructured memory accesses, and variable/limited amount of parallelism. This code, based on a domain decomposition approach, exposes tradeoffs between the number of threads assigned to each MPI-rank sub domain, and the total number of domains. By applying several algorithm- and architecture-aware optimization techniques for unstructured grids, we show a 6.9X speed-up in performance on a single-node Intel® XeonTM1 E5 2690 v2 processor relative to the out-of-the-box compilation. Our scaling studies on TACC Stampede supercomputer show that our optimizations continue to provide performance benefits over baseline implementation as we scale up to 256 nodes.

  4. A learnable parallel processing architecture towards unity of memory and computing.

    Science.gov (United States)

    Li, H; Gao, B; Chen, Z; Zhao, Y; Huang, P; Ye, H; Liu, L; Liu, X; Kang, J

    2015-08-14

    Developing energy-efficient parallel information processing systems beyond von Neumann architecture is a long-standing goal of modern information technologies. The widely used von Neumann computer architecture separates memory and computing units, which leads to energy-hungry data movement when computers work. In order to meet the need of efficient information processing for the data-driven applications such as big data and Internet of Things, an energy-efficient processing architecture beyond von Neumann is critical for the information society. Here we show a non-von Neumann architecture built of resistive switching (RS) devices named "iMemComp", where memory and logic are unified with single-type devices. Leveraging nonvolatile nature and structural parallelism of crossbar RS arrays, we have equipped "iMemComp" with capabilities of computing in parallel and learning user-defined logic functions for large-scale information processing tasks. Such architecture eliminates the energy-hungry data movement in von Neumann computers. Compared with contemporary silicon technology, adder circuits based on "iMemComp" can improve the speed by 76.8% and the power dissipation by 60.3%, together with a 700 times aggressive reduction in the circuit area.

  5. MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems.

    Science.gov (United States)

    González-Domínguez, Jorge; Liu, Yongchao; Touriño, Juan; Schmidt, Bertil

    2016-12-15

    MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively.

  6. Effects of a Memory Training Program in Older People with Severe Memory Loss

    Science.gov (United States)

    Mateos, Pedro M.; Valentin, Alberto; González-Tablas, Maria del Mar; Espadas, Verónica; Vera, Juan L.; Jorge, Inmaculada García

    2016-01-01

    Strategies based memory training programs are widely used to enhance the cognitive abilities of the elderly. Participants in these training programs are usually people whose mental abilities remain intact. Occasionally, people with cognitive impairment also participate. The aim of this study was to test if memory training designed specifically for…

  7. PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers

    Energy Technology Data Exchange (ETDEWEB)

    Choi, Jaeyoung; Walker, D.W. [Oak Ridge National Lab., TN (US); Dongarra, J.J. [Oak Ridge National Lab., TN (US)]|[Univ. of Tennessee, Knoxville, TN (US). Dept. of Computer Science

    1993-08-01

    This paper describes the Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PUMMA package includes not only the non-transposed matrix multiplication routine C = A{center_dot}B, but also transposed multiplication routines C = A{sup T}{center_dot}B, C = A{center_dot}B{sup T}, and C = A{sup T}{center_dot}B{sup T}, for a block scattered data distribution. The routines perform efficiently for a wide range of processor configurations and block sizes. The PUMMA together provide the same functionality as the Level 3 BLAS routine xGEMM. Details of the parallel implementation of the routines are given, and results are presented for runs on the Intel Touchstone Delta computer.

  8. Mobile and replicated alignment of arrays in data-parallel programs

    Science.gov (United States)

    Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

    1993-01-01

    When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

  9. Energy consumption model over parallel programs implemented on multicore architectures

    Directory of Open Access Journals (Sweden)

    Ricardo Isidro-Ramirez

    2015-06-01

    Full Text Available In High Performance Computing, energy consump-tion is becoming an important aspect to consider. Due to the high costs that represent energy production in all countries it holds an important role and it seek to find ways to save energy. It is reflected in some efforts to reduce the energy requirements of hardware components and applications. Some options have been appearing in order to scale down energy use and, con-sequently, scale up energy efficiency. One of these strategies is the multithread programming paradigm, whose purpose is to produce parallel programs able to use the full amount of computing resources available in a microprocessor. That energy saving strategy focuses on efficient use of multicore processors that are found in various computing devices, like mobile devices. Actually, as a growing trend, multicore processors are found as part of various specific purpose computers since 2003, from High Performance Computing servers to mobile devices. However, it is not clear how multiprogramming affects energy efficiency. This paper presents an analysis of different types of multicore-based architectures used in computing, and then a valid model is presented. Based on Amdahl’s Law, a model that considers different scenarios of energy use in multicore architectures it is proposed. Some interesting results were found from experiments with the developed algorithm, that it was execute of a parallel and sequential way. A lower limit of energy consumption was found in a type of multicore architecture and this behavior was observed experimentally.

  10. An empirical study of FORTRAN programs for parallelizing compilers

    Science.gov (United States)

    Shen, Zhiyu; Li, Zhiyuan; Yew, Pen-Chung

    1990-01-01

    Some results are reported from an empirical study of program characteristics that are important in parallelizing compiler writers, especially in the area of data dependence analysis and program transformations. The state of the art in data dependence analysis and some parallel execution techniques are examined. The major findings are included. Many subscripts contain symbolic terms with unknown values. A few methods of determining their values at compile time are evaluated. Array references with coupled subscripts appear quite frequently; these subscripts must be handled simultaneously in a dependence test, rather than being handled separately as in current test algorithms. Nonzero coefficients of loop indexes in most subscripts are found to be simple: they are either 1 or -1. This allows an exact real-valued test to be as accurate as an exact integer-valued test for one-dimensional or two-dimensional arrays. Dependencies with uncertain distance are found to be rather common, and one of the main reasons is the frequent appearance of symbolic terms with unknown values.

  11. Poker on the Cosmic Cube: The First Retargetable Parallel Programming Language and Environment.

    Science.gov (United States)

    1986-06-01

    parallel programming environment, to new parallel architectures. The specifics are illustrated by describing the retarget of Poker to CalTech’s Cosmic Cube. Poker requires only three features from the target architecture: MIMD operation, message passing inter-process communication, and a sequential language (e.g. C) for the processor elements. In return Poker gives the new architecture a complete parallel programming environment which will compile Poker parallel programs without modification, into efficient object code for the new architecture.

  12. On distributed memory MPI-based parallelization of SPH codes in massive HPC context

    Science.gov (United States)

    Oger, G.; Le Touzé, D.; Guibert, D.; de Leffe, M.; Biddiscombe, J.; Soumagne, J.; Piccinali, J.-G.

    2016-03-01

    Most of particle methods share the problem of high computational cost and in order to satisfy the demands of solvers, currently available hardware technologies must be fully exploited. Two complementary technologies are now accessible. On the one hand, CPUs which can be structured into a multi-node framework, allowing massive data exchanges through a high speed network. In this case, each node is usually comprised of several cores available to perform multithreaded computations. On the other hand, GPUs which are derived from the graphics computing technologies, able to perform highly multi-threaded calculations with hundreds of independent threads connected together through a common shared memory. This paper is primarily dedicated to the distributed memory parallelization of particle methods, targeting several thousands of CPU cores. The experience gained clearly shows that parallelizing a particle-based code on moderate numbers of cores can easily lead to an acceptable scalability, whilst a scalable speedup on thousands of cores is much more difficult to obtain. The discussion revolves around speeding up particle methods as a whole, in a massive HPC context by making use of the MPI library. We focus on one particular particle method which is Smoothed Particle Hydrodynamics (SPH), one of the most widespread today in the literature as well as in engineering.

  13. An object-oriented bulk synchronous parallel library for multicore programming

    NARCIS (Netherlands)

    Yzelman, A.N.; Bisseling, R.H.

    2012-01-01

    We show that the bulk synchronous parallel (BSP) model, originally designed for distributed-memory systems, is also applicable for shared-memory multicore systems and, furthermore, that BSP libraries are useful in scientific computing on these systems. A proof-of-concept MulticoreBSP library has

  14. Parallelizing dynamic sequential programs using polyhedral process networks

    NARCIS (Netherlands)

    Nadezhkin, Dmitry

    2012-01-01

    The Polyhedral Process Network (PPN) is a suitable parallel model of computation (MoC) used to specify embedded streaming applications in a parallel form facilitating the efficient mapping onto embedded parallel execution platforms. Unfortunately, specifying an application using a parallel MoC is a

  15. A Tool for Performance Modeling of Parallel Programs

    Directory of Open Access Journals (Sweden)

    J.A. González

    2003-01-01

    Full Text Available Current performance prediction analytical models try to characterize the performance behavior of actual machines through a small set of parameters. In practice, substantial deviations are observed. These differences are due to factors as memory hierarchies or network latency. A natural approach is to associate a different proportionality constant with each basic block, and analogously, to associate different latencies and bandwidths with each "communication block". Unfortunately, to use this approach implies that the evaluation of parameters must be done for each algorithm. This is a heavy task, implying experiment design, timing, statistics, pattern recognition and multi-parameter fitting algorithms. Software support is required. We present a compiler that takes as source a C program annotated with complexity formulas and produces as output an instrumented code. The trace files obtained from the execution of the resulting code are analyzed with an interactive interpreter, giving us, among other information, the values of those parameters.

  16. Parallel database search and prime factorization with magnonic holographic memory devices

    Energy Technology Data Exchange (ETDEWEB)

    Khitun, Alexander [Electrical and Computer Engineering Department, University of California - Riverside, Riverside, California 92521 (United States)

    2015-12-28

    In this work, we describe the capabilities of Magnonic Holographic Memory (MHM) for parallel database search and prime factorization. MHM is a type of holographic device, which utilizes spin waves for data transfer and processing. Its operation is based on the correlation between the phases and the amplitudes of the input spin waves and the output inductive voltage. The input of MHM is provided by the phased array of spin wave generating elements allowing the producing of phase patterns of an arbitrary form. The latter makes it possible to code logic states into the phases of propagating waves and exploit wave superposition for parallel data processing. We present the results of numerical modeling illustrating parallel database search and prime factorization. The results of numerical simulations on the database search are in agreement with the available experimental data. The use of classical wave interference may results in a significant speedup over the conventional digital logic circuits in special task data processing (e.g., √n in database search). Potentially, magnonic holographic devices can be implemented as complementary logic units to digital processors. Physical limitations and technological constrains of the spin wave approach are also discussed.

  17. SONOS Nonvolatile Memory Cell Programming Characteristics

    Science.gov (United States)

    MacLeod, Todd C.; Phillips, Thomas A.; Ho, Fat D.

    2010-01-01

    Silicon-oxide-nitride-oxide-silicon (SONOS) nonvolatile memory is gaining favor over conventional EEPROM FLASH memory technology. This paper characterizes the SONOS write operation using a nonquasi-static MOSFET model. This includes floating gate charge and voltage characteristics as well as tunneling current, voltage threshold and drain current characterization. The characterization of the SONOS memory cell predicted by the model closely agrees with experimental data obtained from actual SONOS memory cells. The tunnel current, drain current, threshold voltage and read drain current all closely agreed with empirical data.

  18. An approach to multicore parallelism using functional programming: A case study based on Presburger Arithmetic

    DEFF Research Database (Denmark)

    Dung, Phan Anh; Hansen, Michael Reichhardt

    2015-01-01

    platform executing on an 8-core machine. A speedup of approximately 4 was obtained for Cooper’s algorithm and a speedup of approximately 6 was obtained for the exact-shadow part of the Omega Test. The considered procedures are complex, memory-intense algorithms on huge formula trees and the case study...... reveals more general applicable techniques and guideline for deriving parallel algorithms from sequential ones in the context of data-intensive tree algorithms. The obtained insights should apply for any strict and impure functional programming language. Furthermore, the results obtained for the exact......-shadow elimination procedure have a wider applicability because they can directly be transferred to the Fourier–Motzkinelimination method....

  19. Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

    CERN Document Server

    Kristensen, Mads Ruben Burgdorff

    2012-01-01

    This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between scheduled operations, it is possible to aggressively initiate communication and lazily evaluate tasks to allow maximal time for the communication to finish before entering a wait state. We implement a heuristic of this model in DistNumPy, an auto-parallelizing version of numerical Python that allows sequential NumPy programs to run on distributed memory architectures. Furthermore, we present performance comparisons for eight benchmarks with and without automatic latency-hiding. The results shows that our model reduces the time spent on waiting for communication as much as 27 times, from a maximum of 54% to only 2% of t...

  20. Parallel conjugate gradient: effects of ordering strategies, programming paradigms, and architectural platforms

    Energy Technology Data Exchange (ETDEWEB)

    Oliker, L.; Li, X.; Heber, G.; Biswas, R.

    2000-05-01

    The Conjugate Gradient (CG) algorithm is perhaps the best-known iterative technique to solve sparse linear systems that are symmetric and positive definite. A sparse matrix-vector multiply (SPMV) usually accounts for most of the floating-point operations with a CG iteration. In this paper, we investigate the effects of various ordering and partitioning strategies on the performance of parallel CG and SPMV using different programming and architectures. Results show that for this class of applications, ordering significantly improves overall performance, that cache reuse may be more important than reducing communication, and that it is possible to achieve message passing performance using shared memory constructs through careful data ordering and distribution. However, a multithreaded implementation of CG on the Tera MTA does not require special ordering or partitioning to obtain high efficiency and scalability.

  1. 支持多核并行程序确定性重放的高效访存冲突记录方法%High Efficient Memory Race Recording Scheme for Parallel Program Deterministic Replay Under Multi-Core Architecture

    Institute of Scientific and Technical Information of China (English)

    刘磊; 黄河; 唐志敏

    2012-01-01

    多核系统中并行程序执行过程的不确定性给程序调试带来了很大的困难.准确记录初始执行中冲突访存的次序是并行程序确定性重放的基础.提出了通过建立精确happens-before关系记录访存冲突的方法.此方法利用简洁高效的地址冲突检测机制确定冲突访存操作在执行中所处happens-before序关系的位置,可以抑制部分记录信息的产生,从而有效减少记录信息.与其他方式方法相比,可以进一步压缩17%的记录条数.采用逻辑向量时钟描述冲突访存操作间的happens-before关系,与采用标量时钟相比,可以避免happens-before关系的误识,降低重放执行时并行度的损失.%Current shared memory multi-core and multiprocessor systems are nondeterministic. When these systems execute a multithreaded application, even if supplied with the same input, they could produce a different output each time. It frustrates debugging and limits the ability to properly test multithreaded code, and is becoming a major stumbling block to the much-needed widespread adoption of parallel programming. The support for deterministic replay of multithreaded execution is greatly helpful in finding concurrency bugs. A memory race recording scheme, named Rainbow, is proposed. Its core idea is to make inter-thread communications fully deterministic. The unique feature of Rainbow is that it precisely sets up happens-before relationships between conflicting memory operations among different threads. By using effective, bloom-filter based, coherence history queue, Rainbow removes redundant happens-before relations implied in the already generated log and enables a compact log. Rainbow adds the modest hardware to the base multi-core processors, and the coherence protocol is unmodified. The analysis results show that Rainbow reduces the log size by 17% of a state-of-the-art scheme, and the records execution speed is similar to that of release consistency (RC) execution

  2. The Rochester Checkers Player: Multi-Model Parallel Programming for Animate Vision

    Science.gov (United States)

    1991-06-01

    parallel programming is likely to serve for all tasks, however. Early vision algorithms are intensely data parallel, often utilizing fine-grain parallel computations that share an image, while cognition algorithms decompose naturally by function, often consisting of loosely-coupled, coarse-grain parallel units. A typical animate vision application will likely consist of many tasks, each of which may require a different parallel programming model, and all of which must cooperate to achieve the desired behavior. These multi-model programs require an

  3. Safe self-scheduling: A parallel loop scheduling scheme for shared-memory multiprocessors

    Energy Technology Data Exchange (ETDEWEB)

    Liu, J. [Western Oregon State College, Monmouth, OR (United States); Saletore, V.A. [Oregon State Univ., Corvallis, OR (United States); Lewis, T.G. [Naval Postgraduate School, Monterey, CA (United States)

    1994-12-01

    In this paper we present Safe Self-Scheduling (SSS), a new scheduling scheme that schedules parallel loops with variable length iteration execution times not known at compile time. The scheme assumes a shared memory space. SSS combines static scheduling with dynamic scheduling and draws favorable advantages from each. First, it reduces the dynamic scheduling overhead by statistically scheduling a major portion of loop iterations. Second, the workload is balanced with simple and efficient self-scheduling scheme by applying a new measure, the smallest critical chore size. Experimental results comparing SSS with other scheduling schemes indicate that SSS surpasses other scheduling schemes. In the experiment on Gauss-Jordan, an application that is suitable for static scheduling schemes, SSS is the only self-scheduling scheme that outperforms the static scheduling scheme. This indicates that SSS achieves a balanced workload with a very small amount of overhead.

  4. Parallel main-memory indexing for moving-object query and update workloads

    DEFF Research Database (Denmark)

    Sidlauskas, Darius; Saltenis, Simonas; Jensen, Christian Søndergaard

    2012-01-01

    as combinations of deletions and insertions; thus, the query semantics guarantee that no objects are missed in query results. Empirical studies demonstrate that PGrid scales near-linearly with the number of hardware threads on four modern multi-core processors. Since both updates and queries are processed...... of supporting the location-related query and update workloads generated by very large populations of such moving objects. This paper presents a main-memory indexing technique that aims to support such workloads. The technique, called PGrid, uses a grid structure that is capable of exploiting the parallelism......, it avoids the stop-the-world problem that occurs when workload processing is interrupted to perform such snapshotting. Its concurrency control mechanism relies instead on hardware-assisted atomic updates as well as object-level copying, and it treats updates as non-divisible operations rather than...

  5. Expert Programmer versus Parallelizing Compiler: A Comparative Study of Two Approaches for Distributed Shared Memory

    Directory of Open Access Journals (Sweden)

    M. F. P. O'Boyle

    1996-01-01

    Full Text Available This article critically examines current parallel programming practice and optimizing compiler development. The general strategies employed by compiler and programmer to optimize a Fortran program are described, and then illustrated for a specific case by applying them to a well-known scientific program, TRED2, using the KSR-1 as the target architecture. Extensive measurement is applied to the resulting versions of the program, which are compared with a version produced by a commercial optimizing compiler, KAP. The compiler strategy significantly outperforms KAP and does not fall far short of the performance achieved by the programmer. Following the experimental section each approach is critiqued by the other. Perceived flaws, advantages, and common ground are outlined, with an eye to improving both schemes.

  6. Memory for radio advertisements: the effect of program and typicality.

    Science.gov (United States)

    Martín-Luengo, Beatriz; Luna, Karlos; Migueles, Malen

    2013-01-01

    We examined the influence of the type of radio program on the memory for radio advertisements. We also investigated the role in memory of the typicality (high or low) of the elements of the products advertised. Participants listened to three types of programs (interesting, boring, enjoyable) with two advertisements embedded in each. After completing a filler task, the participants performed a true/false recognition test. Hits and false alarm rates were higher for the interesting and enjoyable programs than for the boring one. There were also more hits and false alarms for the high-typicality elements. The response criterion for the advertisements embedded in the boring program was stricter than for the advertisements in other types of programs. We conclude that the type of program in which an advertisement is inserted and the nature of the elements of the advertisement affect both the number of hits and false alarms and the response criterion, but not the accuracy of the memory.

  7. Parallel Programming Methodologies for Non-Uniform Structured Problems in Materials Science

    Science.gov (United States)

    1993-10-01

    COVERED 1 10/93 _ Interim 12/01/92 - 09/30/93 4. TITLE AND SUBTITLE 5. FUNDING NUMBERS Parallel Programming Methodologies for Non-Uniform Structured...Dear Dr. van Tilborg, Enclosed you will find the annual report for " Parallel Programming Methodolo- gies for Non-Uniform Structured Problems in...Quincy Street Arlington, VA 22217-5660 Dear Dr. van Tilborg, Enclosed you will find the annual report for " Parallel Programming Methodolo- gies for Non

  8. The Medial Temporal Lobe – Conduit of Parallel Connectivity: A model for Attention, Memory, and Perception.

    Directory of Open Access Journals (Sweden)

    Brian B. Mozaffari

    2014-11-01

    Full Text Available Based on the notion that the brain is equipped with a hierarchical organization, which embodies environmental contingencies across many time scales, this paper suggests that the medial temporal lobe (MTL – located deep in the hierarchy – serves as a bridge connecting supra to infra – MTL levels. Bridging the upper and lower regions of the hierarchy provides a parallel architecture that optimizes information flow between upper and lower regions to aid attention, encoding, and processing of quick complex visual phenomenon. Bypassing intermediate hierarchy levels, information conveyed through the MTL ‘bridge’ allows upper levels to make educated predictions about the prevailing context and accordingly select lower representations to increase the efficiency of predictive coding throughout the hierarchy. This selection or activation/deactivation is associated with endogenous attention. In the event that these ‘bridge’ predictions are inaccurate, this architecture enables the rapid encoding of novel contingencies. A review of hierarchical models in relation to memory is provided along with a new theory, Medial-temporal-lobe Conduit for Parallel Connectivity (MCPC. In this scheme, consolidation is considered as a secondary process, occurring after a MTL-bridged connection, which eventually allows upper and lower levels to access each other directly. With repeated reactivations, as contingencies become consolidated, less MTL activity is predicted. Finally, MTL bridging may aid processing transient but structured perceptual events, by allowing communication between upper and lower levels without calling on intermediate levels of representation.

  9. The medial temporal lobe-conduit of parallel connectivity: a model for attention, memory, and perception.

    Science.gov (United States)

    Mozaffari, Brian

    2014-01-01

    Based on the notion that the brain is equipped with a hierarchical organization, which embodies environmental contingencies across many time scales, this paper suggests that the medial temporal lobe (MTL)-located deep in the hierarchy-serves as a bridge connecting supra- to infra-MTL levels. Bridging the upper and lower regions of the hierarchy provides a parallel architecture that optimizes information flow between upper and lower regions to aid attention, encoding, and processing of quick complex visual phenomenon. Bypassing intermediate hierarchy levels, information conveyed through the MTL "bridge" allows upper levels to make educated predictions about the prevailing context and accordingly select lower representations to increase the efficiency of predictive coding throughout the hierarchy. This selection or activation/deactivation is associated with endogenous attention. In the event that these "bridge" predictions are inaccurate, this architecture enables the rapid encoding of novel contingencies. A review of hierarchical models in relation to memory is provided along with a new theory, Medial-temporal-lobe Conduit for Parallel Connectivity (MCPC). In this scheme, consolidation is considered as a secondary process, occurring after a MTL-bridged connection, which eventually allows upper and lower levels to access each other directly. With repeated reactivations, as contingencies become consolidated, less MTL activity is predicted. Finally, MTL bridging may aid processing transient but structured perceptual events, by allowing communication between upper and lower levels without calling on intermediate levels of representation.

  10. 并行程序设计语言发展现状%Current Development of Parallel Programming Language

    Institute of Scientific and Technical Information of China (English)

    韩卫; 郝红宇; 代丽

    2003-01-01

    In this paper we introduce the history of the parallel programming language and list some of currently parallel programming languages. Then according to the classified principle. We analyze some of the representative parallel programming languages in detail. Finally, we show a further feature to the parallel programming language.

  11. Kemari: A Portable High Performance Fortran System for Distributed Memory Parallel Processors

    Directory of Open Access Journals (Sweden)

    T. Kamachi

    1997-01-01

    Full Text Available We have developed a compilation system which extends High Performance Fortran (HPF in various aspects. We support the parallelization of well-structured problems with loop distribution and alignment directives similar to HPF's data distribution directives. Such directives give both additional control to the user and simplify the compilation process. For the support of unstructured problems, we provide directives for dynamic data distribution through user-defined mappings. The compiler also allows integration of message-passing interface (MPI primitives. The system is part of a complete programming environment which also comprises a parallel debugger and a performance monitor and analyzer. After an overview of the compiler, we describe the language extensions and related compilation mechanisms in detail. Performance measurements demonstrate the compiler's applicability to a variety of application classes.

  12. Processor Allocation for Optimistic Parallelization of Irregular Programs

    CERN Document Server

    Versaci, Francesco

    2012-01-01

    Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution of the irregular algorithm. Therefore, the processor allocation problem for irregular algorithms is very difficult. In this paper, we describe the first systematic strategy for addressing this pro...

  13. Exploiting variability for energy optimization of parallel programs

    Energy Technology Data Exchange (ETDEWEB)

    Lavrijsen, Wim [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Iancu, Costin [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); de Jong, Wibe [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Chen, Xin [Georgia Inst. of Technology, Atlanta, GA (United States); Schwan, Karsten [Georgia Inst. of Technology, Atlanta, GA (United States)

    2016-04-18

    Here in this paper we present optimizations that use DVFS mechanisms to reduce the total energy usage in scientific applications. Our main insight is that noise is intrinsic to large scale parallel executions and it appears whenever shared resources are contended. The presence of noise allows us to identify and manipulate any program regions amenable to DVFS. When compared to previous energy optimizations that make per core decisions using predictions of the running time, our scheme uses a qualitative approach to recognize the signature of executions amenable to DVFS. By recognizing the "shape of variability" we can optimize codes with highly dynamic behavior, which pose challenges to all existing DVFS techniques. We validate our approach using offline and online analyses for one-sided and two-sided communication paradigms. We have applied our methods to NWChem, and we show best case improvements in energy use of 12% at no loss in performance when using online optimizations running on 720 Haswell cores with one-sided communication. With NWChem on MPI two-sided and offline analysis, capturing the initialization, we find energy savings of up to 20%, with less than 1% performance cost.

  14. Execution time support for scientific programs on distributed memory machines

    Science.gov (United States)

    Berryman, Harry; Saltz, Joel; Scroggs, Jeffrey

    1990-01-01

    Optimizations are considered that are required for efficient execution of code segments that consists of loops over distributed data structures. The PARTI (Parallel Automated Runtime Toolkit at ICASE) execution time primitives are designed to carry out these optimizations and can be used to implement a wide range of scientific algorithms on distributed memory machines. These primitives allow the user to control array mappings in a way that gives an appearance of shared memory. Computations can be based on a global index set. Primitives are used to carry out gather and scatter operations on distributed arrays. Communications patterns are derived at runtime, and the appropriate send and receive messages are automatically generated.

  15. Memory intervention: the value of a clinical holistic program for older adults with memory impairments.

    Science.gov (United States)

    Hyer, Lee; Scott, Ciera; Lyles, Jessica; Dhabliwala, Jason; McKenzie, Laura

    2014-03-01

    Increasingly, cognitive training appears an asset in improving attention and working memory for older adults. We conducted a study involving a 'holistic' training program for several cohorts of older adults (N = 112), targeting community residents with a spectrum of memory complaints ranging from Age Associated Memory Impairment to mild dementia. We developed a 7-session, manualized program targeting concentration, as well as mindfulness, exercise, stress reduction, socialization, diet, and values/identity techniques. We applied this model to 11 cohorts and conducted pre- and post-testing on memory (List Learning, Story Memory, Coding, Digit Span, Recall, and Recognition) and function (Functional Assessment Questionnaire). We also divided the Memory Group by Risk Status - Low, Medium, and High. Results showed that the Memory Clinic Group as a whole improved on this training on most scales. When broken down by risk status, the Low and Medium Risk Groups were statistically superior to the High Risk Group on cognitive measures. There were differences also on adjustment, this time favoring only the Low Risk Groups. Holistic memory training seems to be impactful for older adults.

  16. Region-based memory management for Mercury programs

    CERN Document Server

    Phan, Quan; Somogyi, Zoltan

    2012-01-01

    Region-based memory management (RBMM) is a form of compile time memory management, well-known from the functional programming world. In this paper we describe our work on implementing RBMM for the logic programming language Mercury. One interesting point about Mercury is that it is designed with strong type, mode, and determinism systems. These systems not only provide Mercury programmers with several direct software engineering benefits, such as self-documenting code and clear program logic, but also give language implementors a large amount of information that is useful for program analyses. In this work, we make use of this information to develop program analyses that determine the distribution of data into regions and transform Mercury programs by inserting into them the necessary region operations. We prove the correctness of our program analyses and transformation. To execute the annotated programs, we have implemented runtime support that tackles the two main challenges posed by backtracking. First, ba...

  17. Checkpointing Shared Memory Programs at the Application-level

    Energy Technology Data Exchange (ETDEWEB)

    Bronevetsky, G; Schulz, M; Szwed, P; Marques, D; Pingali, K

    2004-09-08

    Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart(CPR)-the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. The system has two components: (i)a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

  18. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems.

    Science.gov (United States)

    Stone, John E; Gohara, David; Shi, Guochun

    2010-05-01

    We provide an overview of the key architectural features of recent microprocessor designs and describe the programming model and abstractions provided by OpenCL, a new parallel programming standard targeting these architectures.

  19. Research of Parallel Application Program Scheduling Strategy%并行应用程序调度策略研究

    Institute of Scientific and Technical Information of China (English)

    李爱玲; 王璐; 彭云峰

    2012-01-01

    In order to improve the execution efficiency of parallel application programs on the heterogeneous platform, from the paradigm and granularity point of view the parallel components are classified and model is designed. The serial paradigms, messages parallel paradigms or memory shared parallel paradigms and whatever coarse grain,middle grain,fine grain parallel level paradigms can be run well. It also can be programmed against the component programming language. At the same time, based on the description of the components paradigm, grain size and the message of use of resources, it presents the components scheduling policy. Tests show that the component model and strategy can improve the performance of the parallel component running and improves the utilization of heterogeneous platforms.%为了提高并行应用程序在异构平台上的执行效率,从范例、粒度角度对并行组件分类并设计相应模型,从而实现串行、消息并行或内存并行共亭,粗、精、中粒度均可的各类范例的运行,同时也可针对组件的编程语言对范例进行编程.基于对组件范例、粒度的描述及资源使用的信息,进一步提出了组件调度策略,经测试表明组件模型和调度策略改善了并行应用程序的执行,提高了异构平台资源的利用率.

  20. A Verified Integration of Imperative Parallel Programming Paradigms in an Object-Oriented Language

    OpenAIRE

    Sivilotti, Paul

    1993-01-01

    CC++ is a parallel object-oriented programming language that uses parallel composition, atomic functions, and single- assignment variables to express concurrency. We show that this programming paradigm is equivalent to several traditional imperative communication and synchronization models, namely: semaphores, monitors, and asynchronous channels. A collection of libraries which integrates these traditional models with CC++ is specified, implemented, and formally verified.

  1. Efficient implementation of a multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

    Science.gov (United States)

    Bhanot, Gyan V.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Steinmacher-Burow, Burkhard D.; Vranas, Pavlos M.

    2008-01-01

    The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.

  2. Efficient implementation of multidimensional fast fourier transform on a distributed-memory parallel multi-node computer

    Science.gov (United States)

    Bhanot, Gyan V [Princeton, NJ; Chen, Dong [Croton-On-Hudson, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

    2012-01-10

    The present in invention is directed to a method, system and program storage device for efficiently implementing a multidimensional Fast Fourier Transform (FFT) of a multidimensional array comprising a plurality of elements initially distributed in a multi-node computer system comprising a plurality of nodes in communication over a network, comprising: distributing the plurality of elements of the array in a first dimension across the plurality of nodes of the computer system over the network to facilitate a first one-dimensional FFT; performing the first one-dimensional FFT on the elements of the array distributed at each node in the first dimension; re-distributing the one-dimensional FFT-transformed elements at each node in a second dimension via "all-to-all" distribution in random order across other nodes of the computer system over the network; and performing a second one-dimensional FFT on elements of the array re-distributed at each node in the second dimension, wherein the random order facilitates efficient utilization of the network thereby efficiently implementing the multidimensional FFT. The "all-to-all" re-distribution of array elements is further efficiently implemented in applications other than the multidimensional FFT on the distributed-memory parallel supercomputer.

  3. Injecting Artificial Memory Errors Into a Running Computer Program

    Science.gov (United States)

    Bornstein, Benjamin J.; Granat, Robert A.; Wagstaff, Kiri L.

    2008-01-01

    Single-event upsets (SEUs) or bitflips are computer memory errors caused by radiation. BITFLIPS (Basic Instrumentation Tool for Fault Localized Injection of Probabilistic SEUs) is a computer program that deliberately injects SEUs into another computer program, while the latter is running, for the purpose of evaluating the fault tolerance of that program. BITFLIPS was written as a plug-in extension of the open-source Valgrind debugging and profiling software. BITFLIPS can inject SEUs into any program that can be run on the Linux operating system, without needing to modify the program s source code. Further, if access to the original program source code is available, BITFLIPS offers fine-grained control over exactly when and which areas of memory (as specified via program variables) will be subjected to SEUs. The rate of injection of SEUs is controlled by specifying either a fault probability or a fault rate based on memory size and radiation exposure time, in units of SEUs per byte per second. BITFLIPS can also log each SEU that it injects and, if program source code is available, report the magnitude of effect of the SEU on a floating-point value or other program variable.

  4. Retargeting of existing FORTRAN program and development of parallel compilers

    Science.gov (United States)

    Agrawal, Dharma P.

    1988-01-01

    The software models used in implementing the parallelizing compiler for the B-HIVE multiprocessor system are described. The various models and strategies used in the compiler development are: flexible granularity model, which allows a compromise between two extreme granularity models; communication model, which is capable of precisely describing the interprocessor communication timings and patterns; loop type detection strategy, which identifies different types of loops; critical path with coloring scheme, which is a versatile scheduling strategy for any multicomputer with some associated communication costs; and loop allocation strategy, which realizes optimum overlapped operations between computation and communication of the system. Using these models, several sample routines of the AIR3D package are examined and tested. It may be noted that automatically generated codes are highly parallelized to provide the maximized degree of parallelism, obtaining the speedup up to a 28 to 32-processor system. A comparison of parallel codes for both the existing and proposed communication model, is performed and the corresponding expected speedup factors are obtained. The experimentation shows that the B-HIVE compiler produces more efficient codes than existing techniques. Work is progressing well in completing the final phase of the compiler. Numerous enhancements are needed to improve the capabilities of the parallelizing compiler.

  5. A simple and efficient explicit parallelization of logic programs using low-level threading primitives

    CERN Document Server

    Saha, Diptikalyan

    2009-01-01

    In this work, we present an automatic way to parallelize logic programs for finding all the answers to queries using a transformation to low level threading primitives. Although much work has been done in parallelization of logic programming more than a decade ago (e.g., Aurora, Muse, YapOR), the current state of parallelizing logic programs is still very poor. This work presents a way for parallelism of tabled logic programs in XSB Prolog under the well founded semantics. An important contribution of this work relies in merging answer-tables from multiple children threads without incurring copying or full-sharing and synchronization of data-structures. The implementation of the parent-children shared answer-tables surpasses in efficiency all the other data-structures currently implemented for completion of answers in parallelization using multi-threading. The transformation and its lower-level answer merging predicates were implemented as an extension to the XSB system.

  6. Overview of the Force Scientific Parallel Language

    Directory of Open Access Journals (Sweden)

    Gita Alaghband

    1994-01-01

    Full Text Available The Force parallel programming language designed for large-scale shared-memory multiprocessors is presented. The language provides a number of parallel constructs as extensions to the ordinary Fortran language and is implemented as a two-level macro preprocessor to support portability across shared memory multiprocessors. The global parallelism model on which the Force is based provides a powerful parallel language. The parallel constructs, generic synchronization, and freedom from process management supported by the Force has resulted in structured parallel programs that are ported to the many multiprocessors on which the Force is implemented. Two new parallel constructs for looping and functional decomposition are discussed. Several programming examples to illustrate some parallel programming approaches using the Force are also presented.

  7. General purpose parallel programing using new generation graphic processors: CPU vs GPU comparative analysis and opportunities research

    Directory of Open Access Journals (Sweden)

    Donatas Krušna

    2013-03-01

    Full Text Available OpenCL, a modern parallel heterogeneous system programming language, enables problems to be partitioned and executed on modern CPU and GPU hardware, this increases performance of such applications considerably. Since GPU's are optimized for floating point and vector operations and specialize in them, they outperform general purpose CPU's in this field greatly. This language greatly simplifies the creation of applications for such heterogeneous system since it's cross-platform, vendor independent and is embeddable , hence letting it be used in any other general purpose programming language via libraries. There is more and more tools being developed that are aimed at low level programmers and scientists or engineers alike, that are developing applications or libraries for CPU’s and GPU’s of today as well as other heterogeneous platforms. The tendency today is to increase the number of cores or CPU‘s in hopes of increasing performance, however the increasing difficulty of parallelizing applications for such systems and the even increasing overhead of communication and synchronization are limiting the potential performance. This means that there is a point at which increasing cores or CPU‘s will no longer increase applications performance, and even can diminish performance. Even though parallel programming and GPU‘s with stream computing capabilities have decreased the need for communication and synchronization (since only the final result needs to be committed to memory, however this still is a weak link in developing such applications.

  8. Declarative Parallel Programming in Spreadsheet End-User Development

    DEFF Research Database (Denmark)

    Biermann, Florian

    2016-01-01

    Spreadsheets are first-order functional languages and are widely used in research and industry as a tool to conveniently perform all kinds of computations. Because cells on a spreadsheet are immutable, there are possibilities for implicit parallelization of spreadsheet computations. In this liter......Spreadsheets are first-order functional languages and are widely used in research and industry as a tool to conveniently perform all kinds of computations. Because cells on a spreadsheet are immutable, there are possibilities for implicit parallelization of spreadsheet computations...

  9. Machine and Collection Abstractions for User-Implemented Data-Parallel Programming

    Directory of Open Access Journals (Sweden)

    Magne Haveraaen

    2000-01-01

    Full Text Available Data parallelism has appeared as a fruitful approach to the parallelisation of compute-intensive programs. Data parallelism has the advantage of mimicking the sequential (and deterministic structure of programs as opposed to task parallelism, where the explicit interaction of processes has to be programmed. In data parallelism data structures, typically collection classes in the form of large arrays, are distributed on the processors of the target parallel machine. Trying to extract distribution aspects from conventional code often runs into problems with a lack of uniformity in the use of the data structures and in the expression of data dependency patterns within the code. Here we propose a framework with two conceptual classes, Machine and Collection. The Machine class abstracts hardware communication and distribution properties. This gives a programmer high-level access to the important parts of the low-level architecture. The Machine class may readily be used in the implementation of a Collection class, giving the programmer full control of the parallel distribution of data, as well as allowing normal sequential implementation of this class. Any program using such a collection class will be parallelisable, without requiring any modification, by choosing between sequential and parallel versions at link time. Experiments with a commercial application, built using the Sophus library which uses this approach to parallelisation, show good parallel speed-ups, without any adaptation of the application program being needed.

  10. Parallel Engagement of Regions Associated with Encoding and Later Retrieval Forms Durable Memories

    NARCIS (Netherlands)

    Wagner, I.; Buuren, M. van; Bovy, L.; Fernandez, G.S.E.

    2016-01-01

    The fate of a memory is partly determined at initial encoding. However, the behavioral consequences of memory formation are often tested only once and shortly after learning, which leaves the neuronal predictors for the formation of durable memories largely unknown. Here, we hypothesized that durabl

  11. Debugging and Analysis of Large-Scale Parallel Programs

    Science.gov (United States)

    1989-09-01

    vtT in ti, M maps a2 to v’ in t,,n, and j < k. 3. a, and a2 both access the same memory location m, M maps both a, and a 2 to vjm in tin, a, is an...lock is released. In addition to being independent of a particular protocol, our synchronization trac ing technique does not rely on a particular

  12. Work stealing for GPU-accelerated parallel programs in a global address space framework: WORK STEALING ON GPU-ACCELERATED SYSTEMS

    Energy Technology Data Exchange (ETDEWEB)

    Arafat, Humayun [Department of Computer Science and Engineering, The Ohio State University, Columbus OH USA; Dinan, James [Mathematics and Computer Science Division, Argonne National Laboratory, Lemont IL USA; Krishnamoorthy, Sriram [Computer Science and Mathematics Division, Pacific Northwest National Laboratory, Richland WA USA; Balaji, Pavan [Mathematics and Computer Science Division, Argonne National Laboratory, Lemont IL USA; Sadayappan, P. [Department of Computer Science and Engineering, The Ohio State University, Columbus OH USA

    2016-01-06

    Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain.

  13. The Effect of Parallel Programming Languages on the Performance and Energy Consumption of HPC Applications

    Directory of Open Access Journals (Sweden)

    Muhammad Aqib

    2016-02-01

    Full Text Available Big and complex applications need many resources and long computation time to execute sequentially. In this scenario, all application's processes are handled in sequential fashion even if they are independent of each other. In high- performance computing environment, multiple processors are available to running applications in parallel. So mutually independent blocks of codes could run in parallel. This approach not only increases the efficiency of the system without affecting the results but also saves a significant amount of energy. Many parallel programming models or APIs like Open MPI, Open MP, CUDA, etc. are available to running multiple instructions in parallel. In this paper, the efficiency and energy consumption of two known tasks i.e. matrix multiplication and quicksort are analyzed using different parallel programming models and a multiprocessor machine. The obtained results, which can be generalized, outline the effect of choosing a programming model on the efficiency and energy consumption when running different codes on different machines.

  14. CRBLASTER: a fast parallel-processing program for cosmic ray rejection

    Science.gov (United States)

    Mighell, Kenneth J.

    2008-08-01

    Many astronomical image-analysis programs are based on algorithms that can be described as being embarrassingly parallel, where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of todays fast multi-core servers costing a few thousands of dollars. A major reason for the shortage of state-of-the-art parallel-processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called crblaster which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. crblaster is written in C using the industry standard Message Passing Interface (MPI) library. Processing a single 800×800 HST WFPC2 image takes 1.87 seconds using 4 processes on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 processors is 82%. The code can be used as a software framework for easy development of parallel-processing image-anlaysis programs using embarrassing parallel algorithms; the biggest required modification is the replacement of the core image processing function with an alternative image-analysis function based on a single-processor algorithm. I describe the design, implementation and performance of the program.

  15. Training to Enhance Adult Memory (TEAM): an investigation of the effectiveness of a memory training program with older adults.

    Science.gov (United States)

    Fairchild, J Kaci; Scogin, F R

    2010-04-01

    Prior research examining the effectiveness of memory enhancement programs targeting both objective and subjective memory has yielded results with varying degrees of success. The current investigation aimed to contribute to the present body of memory training literature through the evaluation of an in-home memory enhancement program for older adults. Fifty-three community-dwelling older adults were assigned to either a memory enhancement condition or a minimal social support condition. Those in the memory enhancement condition had significant improvement in remembering names with faces and not misplacing household objects. Additionally, those in the memory enhancement condition also reported being more content with their memory, having fewer lapses in memory, greater use of mnemonic strategies, and were less bothered by memory complaints. Regression analyses indicated that neither levels of positive nor negative affect were predictive of participants' objective and subjective memory at post-treatment. Results of these analyses provide support for the use of memory enhancement programs to improve older adults' ability to keep track of items, remember names and faces, and to also feel better about their memory ability.

  16. Exploiting Vector and Multicore Parallelsim for Recursive, Data- and Task-Parallel Programs

    Energy Technology Data Exchange (ETDEWEB)

    Ren, Bin; Krishnamoorthy, Sriram; Agrawal, Kunal; Kulkarni, Milind

    2017-01-26

    Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that these schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.

  17. PSEE: A Tool for Parallel Systems Learning

    OpenAIRE

    E. Luque; R. Suppi; Sorribes, J.; E. Cesar; J. Falguera; Serrano, M.

    2012-01-01

    Programs for parallel computers of distributed memory are difficult to write, understand, evaluate and debug. The design and performance evaluation of algorithms is much more complex than the conventional sequential one. The technical know/how necessary for the implementation of parallel systems is already available, but a critical problem is in the handling of complexity. In parallel distributed memory systems the performance is highly influenced by factors as interconnection scheme, granula...

  18. Initial feasibility and validity of a prospective memory training program in a substance use treatment population.

    Science.gov (United States)

    Sweeney, Mary M; Rass, Olga; Johnson, Patrick S; Strain, Eric C; Berry, Meredith S; Vo, Hoa T; Fishman, Marc J; Munro, Cynthia A; Rebok, George W; Mintzer, Miriam Z; Johnson, Matthew W

    2016-10-01

    Individuals with substance use disorders have shown deficits in the ability to implement future intentions, called prospective memory. Deficits in prospective memory and working memory, a critical underlying component of prospective memory, likely contribute to substance use treatment failures. Thus, improvement of prospective memory and working memory in substance use patients is an innovative target for intervention. We sought to develop a feasible and valid prospective memory training program that incorporates working memory training and may serve as a useful adjunct to substance use disorder treatment. We administered a single session of the novel prospective memory and working memory training program to participants (n = 22; 13 men, 9 women) enrolled in outpatient substance use disorder treatment and correlated performance to existing measures of prospective memory and working memory. Generally accurate prospective memory performance in a single session suggests feasibility in a substance use treatment population. However, training difficulty should be increased to avoid ceiling effects across repeated sessions. Consistent with existing literature, we observed superior performance on event-based relative to time-based prospective memory tasks. Performance on the prospective memory and working memory training components correlated with validated assessments of prospective memory and working memory, respectively. Correlations between novel memory training program performance and established measures suggest that our training engages appropriate cognitive processes. Further, differential event- and time-based prospective memory task performance suggests internal validity of our training. These data support the development of this intervention as an adjunctive therapy for substance use disorders. (PsycINFO Database Record

  19. Time complexity analysis for distributed memory computers: implementation of parallel conjugate gradient method

    NARCIS (Netherlands)

    Hoekstra, A.G.; Sloot, P.M.A.; Haan, M.J.; Hertzberger, L.O.; van Leeuwen, J.

    1991-01-01

    New developments in Computer Science, both hardware and software, offer researchers, such as physicists, unprecedented possibilities to solve their computational intensive problems.However, full exploitation of e.g. new massively parallel computers, parallel languages or runtime environments require

  20. Exploration Of Deep Learning Algorithms Using Openacc Parallel Programming Model

    KAUST Repository

    Hamam, Alwaleed A.

    2017-03-13

    Deep learning is based on a set of algorithms that attempt to model high level abstractions in data. Specifically, RBM is a deep learning algorithm that used in the project to increase it\\'s time performance using some efficient parallel implementation by OpenACC tool with best possible optimizations on RBM to harness the massively parallel power of NVIDIA GPUs. GPUs development in the last few years has contributed to growing the concept of deep learning. OpenACC is a directive based ap-proach for computing where directives provide compiler hints to accelerate code. The traditional Restricted Boltzmann Ma-chine is a stochastic neural network that essentially perform a binary version of factor analysis. RBM is a useful neural net-work basis for larger modern deep learning model, such as Deep Belief Network. RBM parameters are estimated using an efficient training method that called Contrastive Divergence. Parallel implementation of RBM is available using different models such as OpenMP, and CUDA. But this project has been the first attempt to apply OpenACC model on RBM.

  1. Computer simulation program for parallel SITAN. [Sandia Inertia Terrain-Aided Navigation, in FORTRAN

    Energy Technology Data Exchange (ETDEWEB)

    Andreas, R.D.; Sheives, T.C.

    1980-11-01

    This computer program simulates the operation of parallel SITAN using digitized terrain data. An actual trajectory is modeled including the effects of inertial navigation errors and radar altimeter measurements.

  2. F-Nets and Software Cabling: Deriving a Formal Model and Language for Portable Parallel Programming

    Science.gov (United States)

    DiNucci, David C.; Saini, Subhash (Technical Monitor)

    1998-01-01

    Parallel programming is still being based upon antiquated sequence-based definitions of the terms "algorithm" and "computation", resulting in programs which are architecture dependent and difficult to design and analyze. By focusing on obstacles inherent in existing practice, a more portable model is derived here, which is then formalized into a model called Soviets which utilizes a combination of imperative and functional styles. This formalization suggests more general notions of algorithm and computation, as well as insights into the meaning of structured programming in a parallel setting. To illustrate how these principles can be applied, a very-high-level graphical architecture-independent parallel language, called Software Cabling, is described, with many of the features normally expected from today's computer languages (e.g. data abstraction, data parallelism, and object-based programming constructs).

  3. Design of GBSB neural associative memories using semidefinite programming.

    Science.gov (United States)

    Park, J; Cho, H; Park, D

    1999-01-01

    This paper concerns reliable search for the optimally performing GBSB (generalized brain-state-in-a-box) neural associative memory given a set of prototype patterns to be stored as stable equilibrium points. First, we observe some new qualitative properties of the GBSB model. Next, we formulate the synthesis of GBSB neural associative memories as a constrained optimization problem. Finally, we convert the optimization problem into a semidefinite program (SDP), which can be solved efficiently by recently developed interior point methods. The validity of this approach is illustrated by a design example.

  4. Resolutions of the Coulomb operator: VIII. Parallel implementation using the modern programming language X10.

    Science.gov (United States)

    Limpanuparb, Taweetham; Milthorpe, Josh; Rendell, Alistair P

    2014-10-30

    Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine.

  5. Hybrid Parallel Programming Models for AMR Neutron Monte-Carlo Transport

    Science.gov (United States)

    Dureau, David; Poëtte, Gaël

    2014-06-01

    This paper deals with High Performance Computing (HPC) applied to neutron transport theory on complex geometries, thanks to both an Adaptive Mesh Refinement (AMR) algorithm and a Monte-Carlo (MC) solver. Several Parallelism models are presented and analyzed in this context, among them shared memory and distributed memory ones such as Domain Replication and Domain Decomposition, together with Hybrid strategies. The study is illustrated by weak and strong scalability tests on complex benchmarks on several thousands of cores thanks to the petaflopic supercomputer Tera100.

  6. Molecular Programming of Immunological Memory in Natural Killer Cells.

    Science.gov (United States)

    Beaulieu, Aimee M; Madera, Sharline; Sun, Joseph C

    2015-01-01

    Immunological memory is a hallmark of the adaptive immune system. Although natural killer (NK) cells have traditionally been classified as a component of the innate immune system, they have recently been shown in mice and humans to exhibit certain features of immunological memory, including an ability to undergo a clonal-like expansion during virus infection, generate long-lived progeny (i.e. memory cells), and mediate recall responses against previously encountered pathogens--all characteristics previously ascribed only to adaptive immune responses by B and T cells in mammals. To date, the molecular events that govern the generation of NK cell memory are not completely understood. Using a mouse model of cytomegalovirus infection, we demonstrate that individual pro-inflammatory IL-12, IL-18, and type I-IFN signaling pathways are indispensible and play non-redundant roles in the generation of virus-specific NK cell memory. Furthermore, we discovered that antigen-specific proliferation and protection by NK cells is mediated by the transcription factor Zbtb32, which is induced by pro-inflammatory cytokines and promotes a cell cycle program in activated NK cells. A greater understanding of the molecular mechanisms controlling NK cell responses will provide novel strategies for tailoring vaccines to target infectious disease.

  7. Repeated Stimulation of Cultured Networks of Rat Cortical Neurons Induces Parallel Memory Traces

    Science.gov (United States)

    le Feber, Joost; Witteveen, Tim; van Veenendaal, Tamar M.; Dijkstra, Jelle

    2015-01-01

    During systems consolidation, memories are spontaneously replayed favoring information transfer from hippocampus to neocortex. However, at present no empirically supported mechanism to accomplish a transfer of memory from hippocampal to extra-hippocampal sites has been offered. We used cultured neuronal networks on multielectrode arrays and…

  8. On the Performance of the Python Programming Language for Serial and Parallel Scientific Computations

    Directory of Open Access Journals (Sweden)

    Xing Cai

    2005-01-01

    Full Text Available This article addresses the performance of scientific applications that use the Python programming language. First, we investigate several techniques for improving the computational efficiency of serial Python codes. Then, we discuss the basic programming techniques in Python for parallelizing serial scientific applications. It is shown that an efficient implementation of the array-related operations is essential for achieving good parallel performance, as for the serial case. Once the array-related operations are efficiently implemented, probably using a mixed-language implementation, good serial and parallel performance become achievable. This is confirmed by a set of numerical experiments. Python is also shown to be well suited for writing high-level parallel programs.

  9. Concurrent extensions to the FORTRAN language for parallel programming of computational fluid dynamics algorithms

    Science.gov (United States)

    Weeks, Cindy Lou

    1986-01-01

    Experiments were conducted at NASA Ames Research Center to define multi-tasking software requirements for multiple-instruction, multiple-data stream (MIMD) computer architectures. The focus was on specifying solutions for algorithms in the field of computational fluid dynamics (CFD). The program objectives were to allow researchers to produce usable parallel application software as soon as possible after acquiring MIMD computer equipment, to provide researchers with an easy-to-learn and easy-to-use parallel software language which could be implemented on several different MIMD machines, and to enable researchers to list preferred design specifications for future MIMD computer architectures. Analysis of CFD algorithms indicated that extensions of an existing programming language, adaptable to new computer architectures, provided the best solution to meeting program objectives. The CoFORTRAN Language was written in response to these objectives and to provide researchers a means to experiment with parallel software solutions to CFD algorithms on machines with parallel architectures.

  10. Waking Up Buried Memories of Old TV Programs

    Science.gov (United States)

    Larzabal, Christelle; Bacon-Macé, Nadège; Muratot, Sophie; Thorpe, Simon J.

    2017-01-01

    Although it has been demonstrated that visual and auditory stimuli can be recalled decades after the initial exposure, previous studies have generally not ruled out the possibility that the material may have been seen or heard during the intervening period. Evidence shows that reactivations of a long-term memory trace play a role in its update and maintenance. In the case of remote or very long-term memories, it is most likely that these reactivations are triggered by the actual re-exposure to the stimulus. In this study we decided to explore whether it is possible to recall stimuli that could not have been re-experienced in the intervening period. We tested the ability of French participants (N = 34, 31 female) to recall 50 TV programs broadcast on average for the last time 44 years ago (from the 60's and early 70's). Potential recall was elicited by the presentation of short audiovisual excerpts of these TV programs. The absence of potential re-exposure to the material was strictly controlled by selecting TV programs that have never been rebroadcast and were not available in the public domain. Our results show that six TV programs were particularly well identified on average across the 34 participants with a median percentage of 71.7% (SD = 13.6, range: 48.5–87.9%). We also obtained 50 single case reports with associated information about the viewing of 23 TV programs including the 6 previous ones. More strikingly, for two cases, retrieval of the title was made spontaneously without the need of a four-proposition choice. These results suggest that re-exposures to the stimuli are not necessary to maintain a memory for a lifetime. These new findings raise fundamental questions about the underlying mechanisms used by the brain to store these very old sensory memories. PMID:28443005

  11. Parallel Libraries to support High-Level Programming

    DEFF Research Database (Denmark)

    Larsen, Morten Nørgaard

    model requires the programmer to think a bit differently, but at the same time the implemented algorithms will perform very well, as shown by the initial tests presented. In the second part of this thesis, I will change focus from the CELL-BE architecture to the more traditionally x86 architecture...... of the more exotic though short-lived heterogeneous CELL Broadband Engine (CELL-BE) architecture added to this shift. Furthermore, the use of cluster computers made of commodity hardware and specialized supercomputers have greatly increased in both industry as well as in the academic world. Finally...... as they would be a single machine. In between is a number of tools helping the programmers handle communication, share data, run loops in parallel, handle algorithms mining huge amounts of data etc. Even though most of them do a good job performance-wise, almost all of them require that the programmers learn...

  12. Optical interconnection network for parallel access to multi-rank memory in future computing systems.

    Science.gov (United States)

    Wang, Kang; Gu, Huaxi; Yang, Yintang; Wang, Kun

    2015-08-10

    With the number of cores increasing, there is an emerging need for a high-bandwidth low-latency interconnection network, serving core-to-memory communication. In this paper, aiming at the goal of simultaneous access to multi-rank memory, we propose an optical interconnection network for core-to-memory communication. In the proposed network, the wavelength usage is delicately arranged so that cores can communicate with different ranks at the same time and broadcast for flow control can be achieved. A distributed memory controller architecture that works in a pipeline mode is also designed for efficient optical communication and transaction address processes. The scaling method and wavelength assignment for the proposed network are investigated. Compared with traditional electronic bus-based core-to-memory communication, the simulation results based on the PARSEC benchmark show that the bandwidth enhancement and latency reduction are apparent.

  13. An efficient parallel algorithm for O(N^2) direct summation method and its variations on distributed-memory parallel machines

    CERN Document Server

    Makino, J

    2001-01-01

    We present a novel, highly efficient algorithm to parallelize O(N^2) direct summation method for N-body problems with individual timesteps on distributed-memory parallel machines such as Beowulf clusters. Previously known algorithms, in which all processors have complete copies of the N-body system, has the serious problem that the communication-computation ratio increases as we increase the number of processors, since the communication cost is independent of the number of processors. In the new algorithm, p processors are organized as a $\\sqrt{p}\\times \\sqrt{p}$ two-dimensional array. Each processor has $N/\\sqrt{p}$ particles, but the data are distributed in such a way that complete system is presented if we look at any row or column consisting of $\\sqrt{p}$ processors. In this algorithm, the communication cost scales as $N /\\sqrt{p}$, while the calculation cost scales as $N^2/p$. Thus, we can use a much larger number of processors without losing efficiency compared to what was practical with previously know...

  14. Protocol-Based Verification of Message-Passing Parallel Programs

    DEFF Research Database (Denmark)

    López-Acosta, Hugo-Andrés; Eduardo R. B. Marques, Eduardo R. B.; Martins, Francisco

    2015-01-01

    translated into a representation read by VCC, a software verifier for C. We successfully verified several MPI programs in a running time that is independent of the number of processes or other input parameters. This contrasts with alternative techniques, notably model checking and runtime verification...

  15. Guide to development of a scalar massive parallel programming on Paragon

    Energy Technology Data Exchange (ETDEWEB)

    Ueshima, Yutaka; Arakawa, Takuya; Sasaki, Akira [Japan Atomic Energy Research Inst., Neyagawa, Osaka (Japan). Kansai Research Establishment; Yokota, Hisasi

    1998-10-01

    Parallel calculations using more than hundred computers had begun in Japan only several years ago. The Intel Paragon XP/S 15GP256 , 75MP834 were introduced as pioneers in Japan Atomic Energy Research Institute (JAERI) to pursue massive parallel simulations for advanced photon and fusion researches. Recently, large number of parallel programs have been transplanted or newly produced to perform the parallel calculations with those computers. However, these programs are developed based on software technologies for conventional super computer, therefore they sometimes cause troubles in the massive parallel computing. In principle, when programs are developed under different computer and operating system (OS), prudent directions and knowledge are needed. However, integration of knowledge and standardization of environment are quite difficult because number of Paragon system and Paragon`s users are very small in Japan. Therefore, we summarized information which was got through the process of development of a massive parallel program in the Paragon XP/S 75MP834. (author)

  16. Parallelized Solution to Semidefinite Programmings in Quantum Complexity Theory

    CERN Document Server

    Wu, Xiaodi

    2010-01-01

    In this paper we present an equilibrium value based framework for solving SDPs via the multiplicative weight update method which is different from the one in Kale's thesis \\cite{Kale07}. One of the main advantages of the new framework is that we can guarantee the convertibility from approximate to exact feasibility in a much more general class of SDPs than previous result. Another advantage is the design of the oracle which is necessary for applying the multiplicative weight update method is much simplified in general cases. This leads to an alternative and easier solutions to the SDPs used in the previous results \\class{QIP(2)}$\\subseteq$\\class{PSPACE} \\cite{JainUW09} and \\class{QMAM}=\\class{PSPACE} \\cite{JainJUW09}. Furthermore, we provide a generic form of SDPs which can be solved in the similar way. By parallelizing every step in our solution, we are able to solve a class of SDPs in \\class{NC}. Although our motivation is from quantum computing, our result will also apply directly to any SDP which satisfie...

  17. Programming a massively parallel, computation universal system: static behavior

    Energy Technology Data Exchange (ETDEWEB)

    Lapedes, A.; Farber, R.

    1986-01-01

    In previous work by the authors, the ''optimum finding'' properties of Hopfield neural nets were applied to the nets themselves to create a ''neural compiler.'' This was done in such a way that the problem of programming the attractors of one neural net (called the Slave net) was expressed as an optimization problem that was in turn solved by a second neural net (the Master net). In this series of papers that approach is extended to programming nets that contain interneurons (sometimes called ''hidden neurons''), and thus deals with nets capable of universal computation. 22 refs.

  18. Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs

    Science.gov (United States)

    2008-01-01

    follows. In Sections 2 and 3, we present precise defini- tions for distributed programs, specifications, and fault- tolerance. We formally state the...Subsequently, experimental results and analysis are presented in Section 6. Related work is discussed in Section 7. Finally, we conclude in Section...infinite com- putation by stuttering at sl. On the other hand, if there exists a state sd such that there is no outgoing transition (or a self-loop

  19. Concurrent Programming Using Actors: Exploiting Large-Scale Parallelism,

    Science.gov (United States)

    1985-10-07

    ORGANIZATION NAME AND ADDRESS 10. PROGRAM ELEMENT. PROJECT. TASK* Artificial Inteligence Laboratory AREA Is WORK UNIT NUMBERS 545 Technology Square...D-R162 422 CONCURRENT PROGRMMIZNG USING f"OS XL?ITP TEH l’ LARGE-SCALE PARALLELISH(U) NASI AC E Al CAMBRIDGE ARTIFICIAL INTELLIGENCE L. G AGHA ET AL...RESOLUTION TEST CHART N~ATIONAL BUREAU OF STANDA.RDS - -96 A -E. __ _ __ __’ .,*- - -- •. - MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL

  20. The differential hippocampal phosphoproteome of Apodemus sylvaticus paralleling spatial memory retrieval in the Barnes maze.

    Science.gov (United States)

    Li, Lin; Csaszar, Edina; Szodorai, Edit; Patil, Sudarshan; Pollak, Arnold; Lubec, Gert

    2014-05-01

    Protein phosphorylation is a well-known and well-documented mechanism in memory processes. Although a large series of protein kinases involved in memory processes have been reported, information on phosphoproteins is limited. It was therefore the aim of the study to determine a partial and differential phosphoproteome along with the corresponding network in hippocampus of a wild caught mouse strain with excellent performance in several paradigms of spatial memory. Apodemus sylvaticus mice were trained in the Barnes maze, a non-invasive test system for spatial memory and untrained mice served as controls. Animals were sacrificed 6h following memory retrieval, hippocampi were taken, proteins extracted and in-solution digestion was carried out with subsequent iTRAQ double labelling. Phosphopeptides were enriched by a TiO2-based method and semi-quantified using two fragmentation principles on the LTQ-orbitrap Velos. In hippocampi of trained animals phosphopeptide levels representing signalling, neuronal, synaptosomal, cytoskeletal and metabolism proteins were at least twofold reduced or increased. Furthermore, a network revealing a link to pathways of ubiquitination, the androgen receptor, small GTPase Rab5 and MAPK signaling as well as synucleins was constructed. This work is relevant for interpretation of previous work and the design of future studies on protein phosphorylation in spatial memory.

  1. Describing, using 'recognition cones'. [parallel-series model with English-like computer program

    Science.gov (United States)

    Uhr, L.

    1973-01-01

    A parallel-serial 'recognition cone' model is examined, taking into account the model's ability to describe scenes of objects. An actual program is presented in an English-like language. The concept of a 'description' is discussed together with possible types of descriptive information. Questions regarding the level and the variety of detail are considered along with approaches for improving the serial representations of parallel systems.

  2. Method for resource control in parallel environments using program organization and run-time support

    Science.gov (United States)

    Ekanadham, Kattamuri (Inventor); Moreira, Jose Eduardo (Inventor); Naik, Vijay Krishnarao (Inventor)

    2001-01-01

    A system and method for dynamic scheduling and allocation of resources to parallel applications during the course of their execution. By establishing well-defined interactions between an executing job and the parallel system, the system and method support dynamic reconfiguration of processor partitions, dynamic distribution and redistribution of data, communication among cooperating applications, and various other monitoring actions. The interactions occur only at specific points in the execution of the program where the aforementioned operations can be performed efficiently.

  3. Task scheduling of parallel programs to optimize communications for cluster of SMPs

    Institute of Scientific and Technical Information of China (English)

    郑纬民; 杨博; 林伟坚; 李志光

    2001-01-01

    This paper discusses the compile time task scheduling of parallel program running on cluster of SMP workstations. Firstly, the problem is stated formally and transformed into a graph partition problem and proved to be NP-Complete. A heuristic algorithm MMP-Solver is then proposed to solve the problem. Experiment result shows that the task scheduling can reduce communication overhead of parallel applications greatly and MMP-Solver outperforms the existing algorithms.

  4. Efficient Parallelization of the Stochastic Dual Dynamic Programming Algorithm Applied to Hydropower Scheduling

    Directory of Open Access Journals (Sweden)

    Arild Helseth

    2015-12-01

    Full Text Available Stochastic dual dynamic programming (SDDP has become a popular algorithm used in practical long-term scheduling of hydropower systems. The SDDP algorithm is computationally demanding, but can be designed to take advantage of parallel processing. This paper presents a novel parallel scheme for the SDDP algorithm, where the stage-wise synchronization point traditionally used in the backward iteration of the SDDP algorithm is partially relaxed. The proposed scheme was tested on a realistic model of a Norwegian water course, proving that the synchronization point relaxation significantly improves parallel efficiency.

  5. Parallel Programming Application to Matrix Algebra in the Spectral Method for Control Systems Analysis, Synthesis and Identification

    Directory of Open Access Journals (Sweden)

    V. Yu. Kleshnin

    2016-01-01

    Full Text Available The article describes the matrix algebra libraries based on the modern technologies of parallel programming for the Spectrum software, which can use a spectral method (in the spectral form of mathematical description to analyse, synthesise and identify deterministic and stochastic dynamical systems. The developed matrix algebra libraries use the following technologies for the GPUs: OmniThreadLibrary, OpenMP, Intel Threading Building Blocks, Intel Cilk Plus for CPUs nVidia CUDA, OpenCL, and Microsoft Accelerated Massive Parallelism.The developed libraries support matrices with real elements (single and double precision. The matrix dimensions are limited by 32-bit or 64-bit memory model and computer configuration. These libraries are general-purpose and can be used not only for the Spectrum software. They can also find application in the other projects where there is a need to perform operations with large matrices.The article provides a comparative analysis of the libraries developed for various matrix operations (addition, subtraction, scalar multiplication, multiplication, powers of matrices, tensor multiplication, transpose, inverse matrix, finding a solution of the system of linear equations through the numerical experiments using different CPU and GPU. The article contains sample programs and performance test results for matrix multiplication, which requires most of all computational resources in regard to the other operations.

  6. CRBLASTER: A Fast Parallel-Processing Program for Cosmic Ray Rejection in Space-Based Observations

    Science.gov (United States)

    Mighell, K.

    Many astronomical image analysis tasks are based on algorithms that can be described as being embarrassingly parallel - where the analysis of one subimage generally does not affect the analysis of another subimage. Yet few parallel-processing astrophysical image-analysis programs exist that can easily take full advantage of today's fast multi-core servers costing a few thousands of dollars. One reason for the shortage of state-of-the-art parallel processing astrophysical image-analysis codes is that the writing of parallel codes has been perceived to be difficult. I describe a new fast parallel-processing image-analysis program called CRBLASTER which does cosmic ray rejection using van Dokkum's L.A.Cosmic algorithm. CRBLASTER is written in C using the industry standard Message Passing Interface library. Processing a single 800 x 800 Hubble Space Telescope Wide-Field Planetary Camera 2 (WFPC2) image takes 1.9 seconds using 4 processors on an Apple Xserve with two dual-core 3.0-GHz Intel Xeons; the efficiency of the program running with the 4 cores is 82%. The code has been designed to be used as a software framework for the easy development of parallel-processing image-analysis programs using embarrassing parallel algorithms; all that needs to be done is to replace the core image processing task (in this case the C function that performs the L.A.Cosmic algorithm) with an alternative image analysis task based on a single processor algorithm. I describe the design and implementation of the program and then discuss how it could possibly be used to quickly do time-critical analysis applications such as those involved with space surveillance or do complex calibration tasks as part of the pipeline processing of images from large focal plane arrays.

  7. Multi-core and Many-core Shared-memory Parallel Raycasting Volume Rendering Optimization and Tuning

    Energy Technology Data Exchange (ETDEWEB)

    Howison, Mark

    2012-01-31

    Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. And, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

  8. Concurrent Collections (CnC): A new approach to parallel programming

    CERN Document Server

    CERN. Geneva

    2010-01-01

    A common approach in designing parallel languages is to provide some high level handles to manipulate the use of the parallel platform. This exposes some aspects of the target platform, for example, shared vs. distributed memory. It may expose some but not all types of parallelism, for example, data parallelism but not task parallelism. This approach must find a balance between the desire to provide a simple view for the domain expert and provide sufficient power for tuning. This is hard for any given architecture and harder if the language is to apply to a range of architectures. Either simplicity or power is lost. Instead of viewing the language design problem as one of providing the programmer with high level handles, we view the problem as one of designing an interface. On one side of this interface is the programmer (domain expert) who knows the application but needs no knowledge of any aspects of the platform. On the other side of the interface is the performance expert (programmer o...

  9. Memory enhancement program for community-based older adults: development and evaluation.

    Science.gov (United States)

    Caprio-Prevette, M D; Fry, P S

    1996-01-01

    The purpose of the study was to develop a multifactorial memory enhancement program for community-dwelling older adults aimed at encouraging positive beliefs and behaviors about memory function and abilities in later life. The study evaluated the effectiveness of cognitive restructuring techniques (56 subjects) as compared to traditional memory training techniques (61 subjects) for purposes of enhancing memory performance. Posttest assessments were conducted after 10 weeks of memory training. Follow-up assessments were conducted 9 weeks later to assess maintenance of memory performance and memory beliefs. Three 2 x 3 (Treatment x Time) repeated measures multivariate analyses of variance were conducted to evaluate the effects of two types of intervention on memory performance, memory perception, and affective symptomatology over time. Results suggest that cognitive restructuring techniques may help community-dwelling older adults gain control over their beliefs about memory and thereby enhance their memory performance.

  10. Parallel Stories, Differentiated Histories. Exploring Fiction and Memory in Spanish and Portuguese Television

    Directory of Open Access Journals (Sweden)

    José Carlos Rueda Laffond

    2013-06-01

    Full Text Available Integrated into an international project on the characteristics of historical fiction on TV in Spain and Portugal during 2001–2012, the study traces the main aspects of these productions as entertainment products and memory strategies. Historical fiction on Iberian television channels express qualitative problems of interpretation. Its development must be related to issues such practices, meanings and forms of recognition, and connected with specific memory systems. The article explores a set of key–points: uses and topics of historical fiction; its visions through similarly proposals; polarization in several historical times, and its convergent perspectives about Franco and Oliveira Salazar as Iberian contemporary dictators.

  11. High performance parallelism pearls 2 multicore and many-core programming approaches

    CERN Document Server

    Jeffers, Jim

    2015-01-01

    High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming - illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of t

  12. Empirical valence bond models for reactive potential energy surfaces: a parallel multilevel genetic program approach.

    Science.gov (United States)

    Bellucci, Michael A; Coker, David F

    2011-07-28

    We describe a new method for constructing empirical valence bond potential energy surfaces using a parallel multilevel genetic program (PMLGP). Genetic programs can be used to perform an efficient search through function space and parameter space to find the best functions and sets of parameters that fit energies obtained by ab initio electronic structure calculations. Building on the traditional genetic program approach, the PMLGP utilizes a hierarchy of genetic programming on two different levels. The lower level genetic programs are used to optimize coevolving populations in parallel while the higher level genetic program (HLGP) is used to optimize the genetic operator probabilities of the lower level genetic programs. The HLGP allows the algorithm to dynamically learn the mutation or combination of mutations that most effectively increase the fitness of the populations, causing a significant increase in the algorithm's accuracy and efficiency. The algorithm's accuracy and efficiency is tested against a standard parallel genetic program with a variety of one-dimensional test cases. Subsequently, the PMLGP is utilized to obtain an accurate empirical valence bond model for proton transfer in 3-hydroxy-gamma-pyrone in gas phase and protic solvent.

  13. Parallel Stories, Differentiated Histories. Exploring Fiction and Memory in Spanish and Portuguese Television

    NARCIS (Netherlands)

    Rueda Laffond, José Carlos; Coronado Ruiz, Carlota; Duff Burnay, Catarina; Díaz Pérez, Susana; Guerra Gómez, Amparo; Santos, Rogério

    2013-01-01

    abstractIntegrated into an international project on the characteristics of historical fiction on TV in Spain and Portugal during 2001–2012, the study traces the main aspects of these productions as entertainment products and memory strategies. Historical fiction on Iberian television channels expres

  14. Parallel Stories, Differentiated Histories. Exploring Fiction and Memory in Spanish and Portuguese Television

    NARCIS (Netherlands)

    Rueda Laffond, José Carlos; Coronado Ruiz, Carlota; Duff Burnay, Catarina; Díaz Pérez, Susana; Guerra Gómez, Amparo; Santos, Rogério

    2013-01-01

    abstractIntegrated into an international project on the characteristics of historical fiction on TV in Spain and Portugal during 2001–2012, the study traces the main aspects of these productions as entertainment products and memory strategies. Historical fiction on Iberian television channels

  15. Grid Service Framework:Supporting Multi-Models Parallel Grid Programming

    Institute of Scientific and Technical Information of China (English)

    邓倩妮; 陆鑫达

    2004-01-01

    Web service is a grid computing technology that promises greater ease-of-use and interoperability than previous distributed computing technologies. This paper proposed Group Service Framework, a grid computing platform based on Microsoft. NET that use web service to: (1) locate and harness volunteer computing resources for different applications, and (2) support multi-models such as Master/Slave, Divide and Conquer, Phase Parallel and so forth parallel programming paradigms in Grid environment, (3) allocate data and balance load dynamically and transparently for grid computing application. The Grid Service Framework based on Microsoft. NET was used to implement several simple parallel computing applications. The results show that the proposed Group Service Framework is suitable for generic parallel numerical computing.

  16. Generating local addresses and communication sets for data-parallel programs

    Science.gov (United States)

    Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

    1993-01-01

    Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance FORTRAN. We show that, for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution and a computation involving the regular section A(l:h:s), the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little run-time overhead and acceptable preprocessing time.

  17. Hardware-Oblivious Parallelism for In-Memory Column-Stores

    NARCIS (Netherlands)

    M. Heimel; M. Saecker; H. Pirk (Holger); S. Manegold (Stefan); V. Markl

    2013-01-01

    htmlabstractThe multi-core architectures of today’s computer systems make parallelism a necessity for performance critical applications. Writing such applications in a generic, hardware-oblivious manner is a challenging problem: Current database systems thus rely on labor-intensive and error-prone

  18. An Improved Algorithm for Parallel Sparse LU Decomposition on a Distributed-Memory Multiprocessor

    NARCIS (Netherlands)

    Koster, J.; Bisseling, R.H.

    2001-01-01

    In this paper we present a new parallel algorithm for the LU decomposition of a general sparse matrix Among its features are matrix redistribution at regular intervals and a dynamic pivot search strategy that adapts itself to the number of pivots produced. Experimental results obtained on a network

  19. Architecture-Adaptive Computing Environment: A Tool for Teaching Parallel Programming

    Science.gov (United States)

    Dorband, John E.; Aburdene, Maurice F.

    2002-01-01

    Recently, networked and cluster computation have become very popular. This paper is an introduction to a new C based parallel language for architecture-adaptive programming, aCe C. The primary purpose of aCe (Architecture-adaptive Computing Environment) is to encourage programmers to implement applications on parallel architectures by providing them the assurance that future architectures will be able to run their applications with a minimum of modification. A secondary purpose is to encourage computer architects to develop new types of architectures by providing an easily implemented software development environment and a library of test applications. This new language should be an ideal tool to teach parallel programming. In this paper, we will focus on some fundamental features of aCe C.

  20. Working memory intervention programs for adults: A systematic review

    Directory of Open Access Journals (Sweden)

    Tânia Maria Netto

    Full Text Available Abstract This systematic review aimed to identify the designs, procedures, and results of empirical studies that performed neuropsychological interventions on WM in adults. Methods: A PubMed and LILACS literature search was conducted using the keywords working memory AND (training OR rehabilitation OR intervention AND adult. Results: Of the seven studies found, three were randomized controlled trials, two were case reports, one was a clinical trial, and one was an evaluation study. With regard to the type of programs and samples, three studies employed global programs with healthy elderly adults and four employed specific programs for samples with neurologically-impaired adults. Conclusions: The effectiveness of the WM intervention programs was more evident in studies that employed specific methods of rehabilitation for samples with neurological disorders than in those based on global programs with healthy adults. There is a need for more empirical studies to verify the effectiveness of WM intervention programs in order to provide adequate guidance for clinical neuropsychologists and future research.

  1. Effects of Childhood Gymnastics Program on Spatial Working Memory.

    Science.gov (United States)

    Hsieh, Shu-Shih; Lin, Chih-Chien; Chang, Yu-Kai; Huang, Chun-Ju; Hung, Tsung-Min

    2017-08-07

    A growing body of evidence has demonstrated the positive effects of physical exercise on cognition in children, and recent studies have specifically investigated the cognitive benefits of exercises involving cognitive-motor interactions, such as gymnastics. This study examined the effect of 8 weeks of gymnastics training on behavioral and neurophysiological measures of spatial working memory in children. Forty-four children aged 7 to 10 yrs were recruited. The experimental group (n = 24; age: 8.7 ± 1.1 yrs) were recruited from Yilan County in Taiwan, while the control group (n = 20; age: 8.6 ± 1.1 yrs) resided in Taipei City. The experimental group undertook 8 weeks of after-school gymnastics training (2 sessions/week, 90 minutes/session), while the control group received no intervention, and were instructed to maintain their routine daily activities. Working memory was assessed by performance on a modified delayed match-to-sample test, and by event-related potential including the P3 component. Data was collected pre and post treatment for the experimental group, and at the same time interval for the control group. Response accuracy improved following the experimental intervention regardless of working memory demands. Likewise, the P3 amplitude was larger at the parietal site after gymnastics training regardless of the task difficulty. Our results suggest that a short period of gymnastics training had a general facilitative effect on spatial working memory at both a behavioral and neurophysiological level in children. These finding highlight the potential importance of exercise programs involving cognitive-motor interactions in stimulating development of spatial cognition during childhood.

  2. All-pairs Shortest Path Algorithm based on MPI+CUDA Distributed Parallel Programming Model

    Directory of Open Access Journals (Sweden)

    Qingshuang Wu

    2013-12-01

    Full Text Available In view of the problem that computing shortest paths in a graph is a complex and time-consuming process, and the traditional algorithm that rely on the CPU as computing unit solely can't meet the demand of real-time processing, in this paper, we present an all-pairs shortest paths algorithm using MPI+CUDA hybrid programming model, which can take use of the overwhelming computing power of the GPU cluster to speed up the processing. This proposed algorithm can combine the advantages of MPI and CUDA programming model, and can realize two-level parallel computing. In the cluster-level, we take use of the MPI programming model to achieve a coarse-grained parallel computing between the computational nodes of the GPU cluster. In the node-level, we take use of the CUDA programming model to achieve a GPU-accelerated fine grit parallel computing in each computational node internal. The experimental results show that the MPI+CUDA-based parallel algorithm can take full advantage of the powerful computing capability of the GPU cluster, and can achieve about hundreds of time speedup; The whole algorithm has good computing performance, reliability and scalability, and it is able to meet the demand of real-time processing of massive spatial shortest path analysis

  3. Convex quadratic programming relaxations for parallel machine scheduling with controllable processing times subject to release times

    Institute of Scientific and Technical Information of China (English)

    ZHANG Feng; CHEN Feng; TANG Guochun

    2004-01-01

    Scheduling unrelated parallel machines with controllable processing times subject to release times is investigated. Based on the convex quadratic programming relaxation and the randomized rounding strategy, a 2-approximation algorithm is obtained for a special case with the all-or-none property and then a 3-approximation algorithm is presented for general problem.

  4. Flash Memory Reliability: Read, Program, and Erase Latency Versus Endurance Cycling

    Science.gov (United States)

    Heidecker, Jason

    2010-01-01

    This report documents the efforts and results of the fiscal year (FY) 2010 NASA Electronic Parts and Packaging Program (NEPP) task for nonvolatile memory (NVM) reliability. This year's focus was to measure latency (read, program, and erase) of NAND Flash memories and determine how these parameters drift with erase/program/read endurance cycling.

  5. Flash Memory Reliability: Read, Program, and Erase Latency Versus Endurance Cycling

    Science.gov (United States)

    Heidecker, Jason

    2010-01-01

    This report documents the efforts and results of the fiscal year (FY) 2010 NASA Electronic Parts and Packaging Program (NEPP) task for nonvolatile memory (NVM) reliability. This year's focus was to measure latency (read, program, and erase) of NAND Flash memories and determine how these parameters drift with erase/program/read endurance cycling.

  6. Nonlinear Memory Capacity of Parallel Time-Delay Reservoir Computers in the Processing of Multidimensional Signals.

    Science.gov (United States)

    Grigoryeva, Lyudmila; Henriques, Julie; Larger, Laurent; Ortega, Juan-Pablo

    2016-07-01

    This letter addresses the reservoir design problem in the context of delay-based reservoir computers for multidimensional input signals, parallel architectures, and real-time multitasking. First, an approximating reservoir model is presented in those frameworks that provides an explicit functional link between the reservoir architecture and its performance in the execution of a specific task. Second, the inference properties of the ridge regression estimator in the multivariate context are used to assess the impact of finite sample training on the decrease of the reservoir capacity. Finally, an empirical study is conducted that shows the adequacy of the theoretical results with the empirical performances exhibited by various reservoir architectures in the execution of several nonlinear tasks with multidimensional inputs. Our results confirm the robustness properties of the parallel reservoir architecture with respect to task misspecification and parameter choice already documented in the literature.

  7. Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

    KAUST Repository

    Quintin, Jean-Noel

    2013-10-01

    Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon\\'s algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid-1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon\\'s algorithm as it can be used on a nonsquare number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene/P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores. © 2013 IEEE.

  8. 分布内存系统中节点间软流水优化技术%Exploiting Inter-node Pipelining Parallelism in Distributed Memory Systems

    Institute of Scientific and Technical Information of China (English)

    陈莉; 张兆庆; 冯晓兵

    2002-01-01

    Maximize parallelism and minimize communication overheads are important issues for distributed memory systems. Communication and data redistribution cannot be avoided even when considering global optimization of data distribution and computation decomposition. A new approach based on loop fusion is presented exploiting pipelining parallelism, thus communication overhead can be hidden and data redistribution can be avoided. This technique exploits pipelining from complex loop structures, which distinguishes itself from traditional pipelining techniques. Ex-periments show that the technique is superior to other optimizations.

  9. Design of a Content Addressable Memory-based Parallel Processor implementing (−1+j-based Binary Number System

    Directory of Open Access Journals (Sweden)

    Tariq Jamil

    2014-11-01

    Full Text Available Contrary to the traditional base 2 binary number system, used in today’s computers, in which a complex number is represented by two separate binary entities, one for the real part and one for the imaginary part, Complex Binary Number System (CBNS, a binary number system with base (−1+j, is used to represent a given complex number in single binary string format. In this paper, CBNS is reviewed and arithmetic algorithms for this number system are presented. The design of a CBNS-based parallel processor utilizing content-addressable memory for implementation of associative dataflow concept has been described and software-related issues have also been explained.

  10. Method, systems, and computer program products for implementing function-parallel network firewall

    Science.gov (United States)

    Fulp, Errin W.; Farley, Ryan J.

    2011-10-11

    Methods, systems, and computer program products for providing function-parallel firewalls are disclosed. According to one aspect, a function-parallel firewall includes a first firewall node for filtering received packets using a first portion of a rule set including a plurality of rules. The first portion includes less than all of the rules in the rule set. At least one second firewall node filters packets using a second portion of the rule set. The second portion includes at least one rule in the rule set that is not present in the first portion. The first and second portions together include all of the rules in the rule set.

  11. Programming Environment for a High-Performance Parallel Supercomputer with Intelligent Communication

    Directory of Open Access Journals (Sweden)

    A. Gunzinger

    1996-01-01

    Full Text Available At the Electronics Laboratory of the Swiss Federal Institute of Technology (ETH in Zürich, the high-performance parallel supercomputer MUSIC (MUlti processor System with Intelligent Communication has been developed. As applications like neural network simulation and molecular dynamics show, the Electronics Laboratory supercomputer is absolutely on par with those of conventional supercomputers, but electric power requirements are reduced by a factor of 1,000, weight is reduced by a factor of 400, and price is reduced by a factor of 100. Software development is a key issue of such parallel systems. This article focuses on the programming environment of the MUSIC system and on its applications.

  12. Teaching Scientific Computing: A Model-Centered Approach to Pipeline and Parallel Programming with C

    Directory of Open Access Journals (Sweden)

    Vladimiras Dolgopolovas

    2015-01-01

    Full Text Available The aim of this study is to present an approach to the introduction into pipeline and parallel computing, using a model of the multiphase queueing system. Pipeline computing, including software pipelines, is among the key concepts in modern computing and electronics engineering. The modern computer science and engineering education requires a comprehensive curriculum, so the introduction to pipeline and parallel computing is the essential topic to be included in the curriculum. At the same time, the topic is among the most motivating tasks due to the comprehensive multidisciplinary and technical requirements. To enhance the educational process, the paper proposes a novel model-centered framework and develops the relevant learning objects. It allows implementing an educational platform of constructivist learning process, thus enabling learners’ experimentation with the provided programming models, obtaining learners’ competences of the modern scientific research and computational thinking, and capturing the relevant technical knowledge. It also provides an integral platform that allows a simultaneous and comparative introduction to pipelining and parallel computing. The programming language C for developing programming models and message passing interface (MPI and OpenMP parallelization tools have been chosen for implementation.

  13. Academic training: From Evolution Theory to Parallel and Distributed Genetic Programming

    CERN Multimedia

    2007-01-01

    2006-2007 ACADEMIC TRAINING PROGRAMME LECTURE SERIES 15, 16 March From 11:00 to 12:00 - Main Auditorium, bldg. 500 From Evolution Theory to Parallel and Distributed Genetic Programming F. FERNANDEZ DE VEGA / Univ. of Extremadura, SP Lecture No. 1: From Evolution Theory to Evolutionary Computation Evolutionary computation is a subfield of artificial intelligence (more particularly computational intelligence) involving combinatorial optimization problems, which are based to some degree on the evolution of biological life in the natural world. In this tutorial we will review the source of inspiration for this metaheuristic and its capability for solving problems. We will show the main flavours within the field, and different problems that have been successfully solved employing this kind of techniques. Lecture No. 2: Parallel and Distributed Genetic Programming The successful application of Genetic Programming (GP, one of the available Evolutionary Algorithms) to optimization problems has encouraged an ...

  14. Parallel programming of exogenous and endogenous components in the antisaccade task.

    Science.gov (United States)

    Massen, Cristina

    2004-04-01

    In the antisaccade task subjects are required to suppress the reflexive tendency to look at a peripherally presented stimulus and to perform a saccade in the opposite direction instead. The present studies aimed at investigating the inhibitory mechanisms responsible for successful performance in this task, testing a hypothesis of parallel programming of exogenous and endogenous components: A reflexive saccade to the stimulus is automatically programmed and competes with the concurrently established voluntary programme to look in the opposite direction. The experiments followed the logic of selectively manipulating the speed of processing of these components and testing the prediction that a selective slowing of the exogenous component should result in a reduced error rate in this task, while a selective slowing of the endogenous component should have the opposite effect. The results provide evidence for the hypothesis of parallel programming and are discussed in the context of alternative accounts of antisaccade performance.

  15. Calculation of illumination conditions at the lunar south pole - parallel programming approach

    Science.gov (United States)

    Figuera, R. Marco; Gläser, P.; Oberst, J.; De Rosa, D.

    2014-04-01

    In this paper we present a parallel programming approach to evaluate illumination conditions at the lunar south pole. Due to the small inclination (1.54°) of the lunar rotational axis with respect to the ecliptic plane and the topography of the lunar south pole, which allows long illumination periods, the study of illumination conditions is of great importance. Several tests were conducted in order to check the viability of the study and to optimize the tool used to calculate such illumination. First results using a simulated case study showed a reduction of the computation time in the order of 8-12 times using parallel programming in the Graphic Processing Unit (GPU) in comparison with sequential programming in the Central Processing Unit (CPU).

  16. Aho-Corasick String Matching on Shared and Distributed Memory Parallel Architectures

    Energy Technology Data Exchange (ETDEWEB)

    Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel

    2012-03-01

    String matching is at the core of many critical applications, including network intrusion detection systems, search engines, virus scanners, spam filters, DNA and protein sequencing, and data mining. For all of these applications string matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. Many software based implementations targeting conventional cache-based microprocessors fail to achieve high and predictable performance requirements, while Field-Programmable Gate Array (FPGA) implementations and dedicated hardware solutions fail to support large data sets (dictionary sizes) and are difficult to integrate and customize. The advent of multicore, multithreaded, and GPU-based systems is opening the possibility for software based solutions to reach very high performance at a sustained rate. This paper compares several software-based implementations of the Aho-Corasick string searching algorithm for high performance systems. We discuss the implementation of the algorithm on several types of shared-memory high-performance architectures (Niagara 2, large x86 SMPs and Cray XMT), distributed memory with homogeneous processing elements (InfiniBand cluster of x86 multicores) and heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C10 GPUs). We describe in detail how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets.

  17. IPULOC - Exploring Dynamic Program Locality with the Instruction Processing Unit for Filling Memory Gap

    Institute of Scientific and Technical Information of China (English)

    黄震春; 李三立

    2002-01-01

    Memory gap has become an essential factor influencing the peak performance of high-speed CPU-based systems. To fill this gap, enlarging cache capacity has been a traditional method based on static program locality principle. However, the order of instructions stored in I-Cache before being sent to Data Processing Unit (DPU) is a kind of useful information that has not ever been utilized before. So an architecture containing an Instruction Processing Unit (IPU) in parallel with the ordinary DPU is proposed. The IPU can prefetch,analyze and preprocess a large amount of instructions otherwise lying in the I-Cache untouched.It is more efficient than the conventional prefetch buffer that can only store several instructions for previewing. By IPU, Load Instructions can be preprocessed while the DPU is executing on data simultaneously. It is termed as "Instruction Processing Unit with LOokahead Cache"(IPULOC for short) in which the idea of dynamic program locality is presented. This paper describes the principle of IPULOC and illustrates the quantitative parameters for evaluation.Tools for simulating the IPULOC have been developed. The simulation result shows that it can improve program locality during program execution, and hence can improve the cache hit ratio correspondingly without further enlarging the on-chip cache that occupies a large portion of chip area.

  18. Algorithmic differentiation of pragma-defined parallel regions differentiating computer programs containing OpenMP

    CERN Document Server

    Förster, Michael

    2014-01-01

    Numerical programs often use parallel programming techniques such as OpenMP to compute the program's output values as efficient as possible. In addition, derivative values of these output values with respect to certain input values play a crucial role. To achieve code that computes not only the output values simultaneously but also the derivative values, this work introduces several source-to-source transformation rules. These rules are based on a technique called algorithmic differentiation. The main focus of this work lies on the important reverse mode of algorithmic differentiation. The inh

  19. Synchronizing Parallel Tasks Using STM

    Directory of Open Access Journals (Sweden)

    Ryan Saptarshi Ray

    2015-03-01

    Full Text Available The past few years have marked the start of a historic transition from sequential to parallel computation. The necessity to write parallel programs is increasing as systems are getting more complex while processor speed increases are slowing down. Current parallel programming uses low-level programming constructs like threads and explicit synchronization using locks to coordinate thread execution. Parallel programs written with these constructs are difficult to design, program and debug. Also locks have many drawbacks which make them a suboptimal solution. One such drawback is that locks should be only used to enclose the critical section of the parallel-processing code. If locks are used to enclose the entire code then the performance of the code drastically decreases. Software Transactional Memory (STM is a promising new approach to programming shared-memory parallel processors. It is a concurrency control mechanism that is widely considered to be easier to use by programmers than locking. It allows portions of a program to execute in isolation, without regard to other, concurrently executing tasks. A programmer can reason about the correctness of code within a transaction and need not worry about complex interactions with other, concurrently executing parts of the program. If STM is used to enclose the entire code then the performance of the code is the same as that of the code in which STM is used to enclose the critical section only and is far better than code in which locks have been used to enclose the entire code. So STM is easier to use than locks as critical section does not need to be identified in case of STM. This paper shows the concept of writing code using Software Transactional Memory (STM and the performance comparison of codes using locks with those using STM. It also shows why the use of STM in parallel-processing code is better than the use of locks.

  20. The Emory University / Grady Memorial Hospital program: Postponing Sexual Involvement.

    Science.gov (United States)

    1994-05-01

    The Postponing Sexual Involvement program of Emory University's Grady Memorial Hospital began as a teen pregnancy prevention program in the early 1970s. Initially Atlanta public schools devoted 5 classroom periods to discussion of sexuality and decision making. Evaluation results indicated that adolescent sexual behavior change was not occurring, so program staff added the Postponing Sexual Involvement component in 1983. This abstinence program was based on the theory that social influence is related to the likelihood of becoming sexually involved, and not lack of knowledge. Specific age groups are targeted, so that attitudes and skills can be promoted until the maturity of handling sexuality is reached. The assumption is that teenagers are not mature enough to understand the implications of their actions and to deal with consequences. Adolescents are encouraged to explore feelings about sexual involvement and to envision how their future can be affected by such behavior. Human sexuality information, including contraception is still provided for 5 periods, with and additional 5 periods on postponing sexual involvement. 4000 8th graders in Atlanta receive this instructional program. The significant feature of the program is the coexistence of the messages that adolescents ought not to get involved sexually at an early age, and that, if sexual involvement does occur, they should use appropriate contraception. A unique feature is the teaching conducted by trained 11th and 12th grade students as teen leaders in presenting information, conducting discussions, teaching assertiveness skills, and providing a forum for practicing handling problem situations. The youth models are important for dispelling the myth that "everybody's doing it." Teen leaders received about 20 hours of training in how to guide discussions about handling social and peer pressures. Structured and guided exercises were conducted for practicing skills in resisting peer pressure. The program

  1. Development of parallel mathematical subroutine library and large scale numerical simulation

    Energy Technology Data Exchange (ETDEWEB)

    Shimizu, Futoshi [Japan Atomic Energy Research Inst., Tokyo (Japan)

    1998-03-01

    In recent years, parallel computers, namely parallel supercomputers and workstation clusters, come into use for large scale numerical simulations in the field of computational science. At present, since the parallel programming is difficult compared to using serial computers, development of efficient program in parallel computers can be easily achieved by incorporating the parallelized numerical subroutines. In Japan Atomic Energy Research Institute (JAERI), portable mathematical subroutine library using MPI (Message Passing Interface) or PVM (Parallel Virtual Machine) is being developed for distributed memory parallel computers. In this report, we present the overview of the parallel library and its application to the parallelization for tight-binding molecular dynamics. (author)

  2. Fixed-dimensional parallel linear programming via relative {Epsilon}-approximations

    Energy Technology Data Exchange (ETDEWEB)

    Goodrich, M.T.

    1996-12-31

    We show that linear programming in IR{sup d} can be solved deterministically in O((log log n){sup d}) time using linear work in the PRAM model of computation, for any fixed constant d. Our method is developed for the CRCW variant of the PRAM parallel computation model, and can be easily implemented to run in O(log n(log log n){sup d-1}) time using linear work on an EREW PRAM. A key component in these algorithms is a new, efficient parallel method for constructing E-nets and E-approximations (which have wide applicability in computational geometry). In addition, we introduce a new deterministic set approximation for range spaces with finite VC-exponent, which we call the {delta}-relative {epsilon}-approximation, and we show how such approximations can be efficiently constructed in parallel.

  3. Run-Time and Compiler Support for Programming in Adaptive Parallel Environments

    Directory of Open Access Journals (Sweden)

    Guy Edjlali

    1997-01-01

    Full Text Available For better utilization of computing resources, it is important to consider parallel programming environments in which the number of available processors varies at run-time. In this article, we discuss run-time support for data-parallel programming in such an adaptive environment. Executing programs in an adaptive environment requires redistributing data when the number of processors changes, and also requires determining new loop bounds and communication patterns for the new set of processors. We have developed a run-time library to provide this support. We discuss how the run-time library can be used by compilers of high-performance Fortran (HPF-like languages to generate code for an adaptive environment. We present performance results for a Navier-Stokes solver and a multigrid template run on a network of workstations and an IBM SP-2. Our experiments show that if the number of processors is not varied frequently, the cost of data redistribution is not significant compared to the time required for the actual computation. Overall, our work establishes the feasibility of compiling HPF for a network of nondedicated workstations, which are likely to be an important resource for parallel programming in the future.

  4. Piezopotential-Programmed Multilevel Nonvolatile Memory As Triggered by Mechanical Stimuli.

    Science.gov (United States)

    Sun, Qijun; Ho, Dong Hae; Choi, Yongsuk; Pan, Caofeng; Kim, Do Hwan; Wang, Zhong Lin; Cho, Jeong Ho

    2016-12-27

    We report the development of a piezopotential-programmed nonvolatile memory array using a combination of ion gel-gated field-effect transistors (FETs) and piezoelectric nanogenerators (NGs). Piezopotentials produced from the NGs under external strains were able to replace the gate voltage inputs associated with the programming/erasing operation of the memory, which reduced the power consumption compared with conventional memory devices. Multilevel data storage in the memory device could be achieved by varying the external bending strain applied to the piezoelectric NGs. The resulting devices exhibited good memory performance, including a large programming/erasing current ratio that exceeded 10(3), multilevel data storage of 2 bits (over 4 levels), performance stability over 100 cycles, and stable data retention over 3000 s. The piezopotential-programmed multilevel nonvolatile memory device described here is important for applications in data-storable electronic skin and advanced human-robot interface operations.

  5. BSP模型下的并行程序设计与开发%Design and Development of Parallel Programs on Bulk Synchronous Parallel Model

    Institute of Scientific and Technical Information of China (English)

    赖树华; 陆朝俊; 孙永强

    2001-01-01

    The Bulk Synchronous Parallel (BSP) model was simply introduced, and the advantage of the parapllel program's design and development on BSP model was discussed. Then it analysed how to design and develop the parallel programs on BSP model and summarized several principles the developer must comply with. At last a useful parallel programming method based on the BSP model was presented: the two phase method of BSP parallel program design. An example was given to illustrate how to make use of the above method and the BSP performance prediction tool.%介绍了BSP(Bulk Synchronous Parallel)模型,讨论了在该模型下进行并行程序设计的优点、并行算法的分析和设计方法及其必须遵守的原则.以两矩阵的乘法为例说明了如何借助BSP并行程序性能预测工具,利用两阶段BSP并行程序设计方法进行BSP并行程序的设计和开发.

  6. Chaining direct memory access data transfer operations for compute nodes in a parallel computer

    Science.gov (United States)

    Archer, Charles J.; Blocksome, Michael A.

    2010-09-28

    Methods, systems, and products are disclosed for chaining DMA data transfer operations for compute nodes in a parallel computer that include: receiving, by an origin DMA engine on an origin node in an origin injection FIFO buffer for the origin DMA engine, a RGET data descriptor specifying a DMA transfer operation data descriptor on the origin node and a second RGET data descriptor on the origin node, the second RGET data descriptor specifying a target RGET data descriptor on the target node, the target RGET data descriptor specifying an additional DMA transfer operation data descriptor on the origin node; creating, by the origin DMA engine, an RGET packet in dependence upon the RGET data descriptor, the RGET packet containing the DMA transfer operation data descriptor and the second RGET data descriptor; and transferring, by the origin DMA engine to a target DMA engine on the target node, the RGET packet.

  7. Self-pacing direct memory access data transfer operations for compute nodes in a parallel computer

    Energy Technology Data Exchange (ETDEWEB)

    Blocksome, Michael A

    2015-02-17

    Methods, apparatus, and products are disclosed for self-pacing DMA data transfer operations for nodes in a parallel computer that include: transferring, by an origin DMA on an origin node, a RTS message to a target node, the RTS message specifying an message on the origin node for transfer to the target node; receiving, in an origin injection FIFO for the origin DMA from a target DMA on the target node in response to transferring the RTS message, a target RGET descriptor followed by a DMA transfer operation descriptor, the DMA descriptor for transmitting a message portion to the target node, the target RGET descriptor specifying an origin RGET descriptor on the origin node that specifies an additional DMA descriptor for transmitting an additional message portion to the target node; processing, by the origin DMA, the target RGET descriptor; and processing, by the origin DMA, the DMA transfer operation descriptor.

  8. The (parallel) approximability of non-boolean satisfiability problems and restricted integer programming

    Science.gov (United States)

    Serna, Maria; Trevisan, Luca; Xhafa, Fatos

    We present parallel approximation algorithms for maximization problems expressible by integer linear programs of a restricted syntactic form introduced by Barland et al. [BKT96]. One of our motivations was to show whether the approximation results in the framework of Barland et al. holds in the parallel setting. Our results are a confirmation of this, and thus we have a new common framework for both computational settings. Also, we prove almost tight non-approximability results, thus solving a main open question of Barland et al. We obtain the results through the constraint satisfaction problem over multi-valued domains, for which we show non-approximability results and develop parallel approximation algorithms. Our parallel approximation algorithms are based on linear programming and random rounding; they are better than previously known sequential algorithms. The non-approximability results are based on new recent progress in the fields of Probabilistically Checkable Proofs and Multi-Prover One-Round Proof Systems [Raz95, Hås97, AS97, RS97].

  9. VISUAL-ORIENTED PARALLEL PROGRAMMING BASED ON STM%基于STM模型的面向可视化并行程序的设计

    Institute of Scientific and Technical Information of China (English)

    王力生; 黄鹏

    2012-01-01

    并行程序设计由于需要考虑进程之间的同步等问题使得编码过程十分复杂.可视化的并行程序设计为程序员提供了图形化的编程模板和骨架来进行并行程序的设计工作,在一定程度上减小了并行程序的设计难度.首先研究软件事务性内存模型,它相对于传统的并行程序设计方法而言有着接口简单灵活,可扩展性强等特点,之后将STM模型运用到可视化程序设计中来,使得其编程接口以UML活动图的形式提供给编程人员使用,不用依赖特定的软件或硬件环境,提高了可视化并行程序设计的通用性与可扩展性.%The encoding process of parallel programming is quite complicated due to the necessity of the consideration of interprocess synchronisation. Visual parallel programming provides the programmers a graphic programming template and skeleton to carry out the design work of parallel programs, this attenuates to certain extent the difficulty of parallel programming. In the paper we first study the model of software transactional memory (STM), compared with conventional parallel programming approaches, it has some advantages such as simple and flexible interface and strong scalability; Then we apply the STM model to visual programming and make its programming interfaces in the form of UML activity graph for the utilisation by the programmers, which no longer relies on the specific software or hardware environment, this improves the universality and scalability of the visual parallel programming.

  10. A short executive function training program improves preschoolers’ working memory

    Directory of Open Access Journals (Sweden)

    Emma eBlakey

    2015-11-01

    Full Text Available Cognitive training has been shown to improve executive functions in middle childhood and adulthood. However, fewer studies have targeted the preschool years – a time when executive functions undergo rapid development. The present study tested the effects of a short four session executive function training program in 54 four-year-olds. The training group significantly improved their working memory from pre-training relative to an active control group. Notably, this effect extended to a task sharing few surface features with the trained tasks, and continued to be apparent three months later. In addition, the benefits of training extended to a measure of mathematical reasoning three months later, indicating that training executive functions during the preschool years has the potential to convey benefits that are both long-lasting and wide-ranging.

  11. Center for Programming Models for Scalable Parallel Computing - Towards Enhancing OpenMP for Manycore and Heterogeneous Nodes

    Energy Technology Data Exchange (ETDEWEB)

    Barbara Chapman

    2012-02-01

    OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.

  12. The Feasibility of Using OpenCL Instead of OpenMP for Parallel CPU Programming

    OpenAIRE

    Karimi, Kamran

    2015-01-01

    OpenCL, along with CUDA, is one of the main tools used to program GPGPUs. However, it allows running the same code on multi-core CPUs too, making it a rival for the long-established OpenMP. In this paper we compare OpenCL and OpenMP when developing and running compute-heavy code on a CPU. Both ease of programming and performance aspects are considered. Since, unlike a GPU, no memory copy operation is involved, our comparisons measure the code generation quality, as well as thread management e...

  13. From functional programming to multicore parallelism: A case study based on Presburger Arithmetic

    DEFF Research Database (Denmark)

    Dung, Phan Anh; Hansen, Michael Reichhardt

    2011-01-01

    The overall goal of this work is studying parallelization of functional programs with the specific case study of decision procedures for Presburger Arithmetic (PA). PA is a first order theory of integers accepting addition as its only operation. Whereas it has wide applications in different areas......, we are interested in using PA in connection with the Duration Calculus Model Checker (DCMC) [5]. There are effective decision procedures for PA including Cooper’s algorithm and the Omega Test; however, their complexity is extremely high with doubly exponential lower bound and triply exponential upper...... in the SMT-solver Z3 [8] which has the capability of solving Presburger formulas. Functional programming is well-suited for the domain of decision procedures, and its immutability feature helps to reduce parallelization effort. While Haskell has progressed with a lot of parallelismrelated research [6], we...

  14. Class Notes: Programming Parallel Algorithms CS 15-840B (Fall 1992)

    Science.gov (United States)

    1993-02-01

    840: Programming Parallel Algorithms Lecture #15 Scribe: Bob Wheeler Thursday, 6 Nov 92 Overview * Connected components (continued). * Minimum spanning...Sriram Sethuraman Singular value decomposition Ken Tew EEG analysis Eric Thayer Speech recognition Xuemei Wang & Bob Wheeler Matrix operations Matt...Computing, 14(4):862-874, 1985. [33] L. W. Tucker, C. R. Feynman , and D. M. Fritzsche. Object recognition using the Connection Machine. Proceedings CVPR 󈨜

  15. The parallel impact of episodic memory and episodic future thinking on food intake.

    Science.gov (United States)

    Vartanian, Lenny R; Chen, William H; Reily, Natalie M; Castel, Alan D

    2016-06-01

    This research examined the effects of both episodic memory and episodic future thinking (EFT) on snack food intake. In Study 1, female participants (n = 158) were asked to recall their lunch from earlier in the day, to think about the dinner they planned to have later in the day, or to think about a non-food activity before taking part in a cookie taste test. Participants who recalled their lunch or who thought about their dinner ate less than did participants who thought about non-food activities. These effects were not explained by group differences in the hedonic value of the food. Study 2 examined whether the suppression effect observed in Study 1 was driven by a general health consciousness. Female participants (n = 74) were asked to think about their past or future exercise (or a non-exercise activity), but thinking about exercise had no impact on participants' cookie consumption. Overall, both thinking about past food intake and imagining future food intake had the same suppression effect on participants' current food intake, but further research is needed to determine the underlying mechanism.

  16. The memory fitness program: cognitive effects of a healthy aging intervention.

    Science.gov (United States)

    Miller, Karen J; Siddarth, Prabha; Gaines, Jean M; Parrish, John M; Ercoli, Linda M; Marx, Katherine; Ronch, Judah; Pilgram, Barbara; Burke, Kasey; Barczak, Nancy; Babcock, Bridget; Small, Gary W

    2012-06-01

    Age-related memory decline affects a large proportion of older adults. Cognitive training, physical exercise, and other lifestyle habits may help to minimize self-perception of memory loss and a decline in objective memory performance. The purpose of this study was to determine whether a 6-week educational program on memory training, physical activity, stress reduction, and healthy diet led to improved memory performance in older adults. A convenience sample of 115 participants (mean age: 80.9 [SD: 6.0 years]) was recruited from two continuing care retirement communities. The intervention consisted of 60-minute classes held twice weekly with 15-20 participants per class. Testing of both objective and subjective cognitive performance occurred at baseline, preintervention, and postintervention. Objective cognitive measures evaluated changes in five domains: immediate verbal memory, delayed verbal memory, retention of verbal information, memory recognition, and verbal fluency. A standardized metamemory instrument assessed four domains of memory self-awareness: frequency and severity of forgetting, retrospective functioning, and mnemonics use. The intervention program resulted in significant improvements on objective measures of memory, including recognition of word pairs (t([114]) = 3.62, p healthy lifestyle program can improve both encoding and recalling of new verbal information, as well as self-perception of memory ability in older adults residing in continuing care retirement communities.

  17. Dynamic programming in parallel boundary detection with application to ultrasound intima-media segmentation.

    Science.gov (United States)

    Zhou, Yuan; Cheng, Xinyao; Xu, Xiangyang; Song, Enmin

    2013-12-01

    Segmentation of carotid artery intima-media in longitudinal ultrasound images for measuring its thickness to predict cardiovascular diseases can be simplified as detecting two nearly parallel boundaries within a certain distance range, when plaque with irregular shapes is not considered. In this paper, we improve the implementation of two dynamic programming (DP) based approaches to parallel boundary detection, dual dynamic programming (DDP) and piecewise linear dual dynamic programming (PL-DDP). Then, a novel DP based approach, dual line detection (DLD), which translates the original 2-D curve position to a 4-D parameter space representing two line segments in a local image segment, is proposed to solve the problem while maintaining efficiency and rotation invariance. To apply the DLD to ultrasound intima-media segmentation, it is imbedded in a framework that employs an edge map obtained from multiplication of the responses of two edge detectors with different scales and a coupled snake model that simultaneously deforms the two contours for maintaining parallelism. The experimental results on synthetic images and carotid arteries of clinical ultrasound images indicate improved performance of the proposed DLD compared to DDP and PL-DDP, with respect to accuracy and efficiency.

  18. Full Or—Parallemism and Restricted And—Parallelism in BTM

    Institute of Scientific and Technical Information of China (English)

    郑宇华; 谢立; 等

    1994-01-01

    BTM is a new And/Or parallel execution model for logic programs which exploits both full Or-parallelism and restricted And-parallelism.The advantages of high parallelism and low run time cost make BTJ,an experimental execution system of BTM implemented on a nonshared-memory multiprocessor system,achieve significant speedup for both And-paralled and Or-parallel logic programs.

  19. Memory

    Science.gov (United States)

    ... it has to decide what is worth remembering. Memory is the process of storing and then remembering this information. There are different types of memory. Short-term memory stores information for a few ...

  20. Shared Variable Oriented Parallel Precompiler for SPMD Model

    Institute of Scientific and Technical Information of China (English)

    1995-01-01

    For the moment,commercial parallel computer systems with distributed memory architecture are usually provided with parallel FORTRAN or parallel C compliers,which are just traditional sequential FORTRAN or C compilers expanded with communication statements.Programmers suffer from writing parallel programs with communication statements. The Shared Variable Oriented Parallel Precompiler (SVOPP) proposed in this paper can automatically generate appropriate communication statements based on shared variables for SPMD(Single Program Multiple Data) computation model and greatly ease the parallel programming with high communication efficiency.The core function of parallel C precompiler has been successfully verified on a transputer-based parallel computer.Its prominent performance shows that SVOPP is probably a break-through in parallel programming technique.

  1. Distributed-memory matrix computations

    DEFF Research Database (Denmark)

    Balle, Susanne Mølleskov

    1995-01-01

    in these algorithms is that many scientific applications rely heavily on the performance of the involved dense linear algebra building blocks. Even though we consider the distributed-memory as well as the shared-memory programming paradigm, the major part of the thesis is dedicated to distributed-memory architectures....... We emphasize distributed-memory massively parallel computers - such as the Connection Machines model CM-200 and model CM-5/CM-5E - available to us at UNI-C and at Thinking Machines Corporation. The CM-200 was at the time this project started one of the few existing massively parallel computers...

  2. Compiler Technology for Parallel Scientific Computation

    Directory of Open Access Journals (Sweden)

    Can Özturan

    1994-01-01

    Full Text Available There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL. Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.

  3. CaKernel – A Parallel Application Programming Framework for Heterogenous Computing Architectures

    Directory of Open Access Journals (Sweden)

    Marek Blazewicz

    2011-01-01

    Full Text Available With the recent advent of new heterogeneous computing architectures there is still a lack of parallel problem solving environments that can help scientists to use easily and efficiently hybrid supercomputers. Many scientific simulations that use structured grids to solve partial differential equations in fact rely on stencil computations. Stencil computations have become crucial in solving many challenging problems in various domains, e.g., engineering or physics. Although many parallel stencil computing approaches have been proposed, in most cases they solve only particular problems. As a result, scientists are struggling when it comes to the subject of implementing a new stencil-based simulation, especially on high performance hybrid supercomputers. In response to the presented need we extend our previous work on a parallel programming framework for CUDA – CaCUDA that now supports OpenCL. We present CaKernel – a tool that simplifies the development of parallel scientific applications on hybrid systems. CaKernel is built on the highly scalable and portable Cactus framework. In the CaKernel framework, Cactus manages the inter-process communication via MPI while CaKernel manages the code running on Graphics Processing Units (GPUs and interactions between them. As a non-trivial test case we have developed a 3D CFD code to demonstrate the performance and scalability of the automatically generated code.

  4. High performance parallel computers for science: New developments at the Fermilab advanced computer program

    Energy Technology Data Exchange (ETDEWEB)

    Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.

    1988-08-01

    Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs.

  5. The FORCE: A portable parallel programming language supporting computational structural mechanics

    Science.gov (United States)

    Jordan, Harry F.; Benten, Muhammad S.; Brehm, Juergen; Ramanan, Aruna

    1989-01-01

    This project supports the conversion of codes in Computational Structural Mechanics (CSM) to a parallel form which will efficiently exploit the computational power available from multiprocessors. The work is a part of a comprehensive, FORTRAN-based system to form a basis for a parallel version of the NICE/SPAR combination which will form the CSM Testbed. The software is macro-based and rests on the force methodology developed by the principal investigator in connection with an early scientific multiprocessor. Machine independence is an important characteristic of the system so that retargeting it to the Flex/32, or any other multiprocessor on which NICE/SPAR might be imnplemented, is well supported. The principal investigator has experience in producing parallel software for both full and sparse systems of linear equations using the force macros. Other researchers have used the Force in finite element programs. It has been possible to rapidly develop software which performs at maximum efficiency on a multiprocessor. The inherent machine independence of the system also means that the parallelization will not be limited to a specific multiprocessor.

  6. Translation techniques for distributed-shared memory programming models

    Energy Technology Data Exchange (ETDEWEB)

    Fuller, Douglas James [Iowa State Univ., Ames, IA (United States)

    2005-01-01

    The high performance computing community has experienced an explosive improvement in distributed-shared memory hardware. Driven by increasing real-world problem complexity, this explosion has ushered in vast numbers of new systems. Each new system presents new challenges to programmers and application developers. Part of the challenge is adapting to new architectures with new performance characteristics. Different vendors release systems with widely varying architectures that perform differently in different situations. Furthermore, since vendors need only provide a single performance number (total MFLOPS, typically for a single benchmark), they only have strong incentive initially to optimize the API of their choice. Consequently, only a fraction of the available APIs are well optimized on most systems. This causes issues porting and writing maintainable software, let alone issues for programmers burdened with mastering each new API as it is released. Also, programmers wishing to use a certain machine must choose their API based on the underlying hardware instead of the application. This thesis argues that a flexible, extensible translator for distributed-shared memory APIs can help address some of these issues. For example, a translator might take as input code in one API and output an equivalent program in another. Such a translator could provide instant porting for applications to new systems that do not support the application's library or language natively. While open-source APIs are abundant, they do not perform optimally everywhere. A translator would also allow performance testing using a single base code translated to a number of different APIs. Most significantly, this type of translator frees programmers to select the most appropriate API for a given application based on the application (and developer) itself instead of the underlying hardware.

  7. Translation techniques for distributed-shared memory programming models

    Energy Technology Data Exchange (ETDEWEB)

    Fuller, Douglas James

    2005-08-01

    The high performance computing community has experienced an explosive improvement in distributed-shared memory hardware. Driven by increasing real-world problem complexity, this explosion has ushered in vast numbers of new systems. Each new system presents new challenges to programmers and application developers. Part of the challenge is adapting to new architectures with new performance characteristics. Different vendors release systems with widely varying architectures that perform differently in different situations. Furthermore, since vendors need only provide a single performance number (total MFLOPS, typically for a single benchmark), they only have strong incentive initially to optimize the API of their choice. Consequently, only a fraction of the available APIs are well optimized on most systems. This causes issues porting and writing maintainable software, let alone issues for programmers burdened with mastering each new API as it is released. Also, programmers wishing to use a certain machine must choose their API based on the underlying hardware instead of the application. This thesis argues that a flexible, extensible translator for distributed-shared memory APIs can help address some of these issues. For example, a translator might take as input code in one API and output an equivalent program in another. Such a translator could provide instant porting for applications to new systems that do not support the application's library or language natively. While open-source APIs are abundant, they do not perform optimally everywhere. A translator would also allow performance testing using a single base code translated to a number of different APIs. Most significantly, this type of translator frees programmers to select the most appropriate API for a given application based on the application (and developer) itself instead of the underlying hardware.

  8. Genome-wide Functional Analysis of CREB/Long-Term Memory-Dependent Transcription Reveals Distinct Basal and Memory Gene Expression Programs

    Science.gov (United States)

    Lakhina, Vanisha; Arey, Rachel N.; Kaletsky, Rachel; Kauffman, Amanda; Stein, Geneva; Keyes, William; Xu, Daniel; Murphy, Coleen T.

    2014-01-01

    SUMMARY Induced CREB activity is a hallmark of long-term memory, but the full repertoire of CREB transcriptional targets required specifically for memory is not known in any system. To obtain a more complete picture of the mechanisms involved in memory, we combined memory training with genome-wide transcriptional analysis of C. elegans CREB mutants. This approach identified 757 significant CREB/memory-induced targets and confirmed the involvement of known memory genes from other organisms, but also suggested new mechanisms and novel components that may be conserved through mammals. CREB mediates distinct basal and memory transcriptional programs at least partially through spatial restriction of CREB activity: basal targets are regulated primarily in nonneuronal tissues, while memory targets are enriched for neuronal expression, emanating from CREB activity in AIM neurons. This suite of novel memory-associated genes will provide a platform for the discovery of orthologous mammalian long-term memory components. PMID:25611510

  9. Genome-wide functional analysis of CREB/long-term memory-dependent transcription reveals distinct basal and memory gene expression programs.

    Science.gov (United States)

    Lakhina, Vanisha; Arey, Rachel N; Kaletsky, Rachel; Kauffman, Amanda; Stein, Geneva; Keyes, William; Xu, Daniel; Murphy, Coleen T

    2015-01-21

    Induced CREB activity is a hallmark of long-term memory, but the full repertoire of CREB transcriptional targets required specifically for memory is not known in any system. To obtain a more complete picture of the mechanisms involved in memory, we combined memory training with genome-wide transcriptional analysis of C. elegans CREB mutants. This approach identified 757 significant CREB/memory-induced targets and confirmed the involvement of known memory genes from other organisms, but also suggested new mechanisms and novel components that may be conserved through mammals. CREB mediates distinct basal and memory transcriptional programs at least partially through spatial restriction of CREB activity: basal targets are regulated primarily in nonneuronal tissues, while memory targets are enriched for neuronal expression, emanating from CREB activity in AIM neurons. This suite of novel memory-associated genes will provide a platform for the discovery of orthologous mammalian long-term memory components.

  10. The neural basis of parallel saccade programming: an fMRI study.

    Science.gov (United States)

    Hu, Yanbo; Walker, Robin

    2011-11-01

    The neural basis of parallel saccade programming was examined in an event-related fMRI study using a variation of the double-step saccade paradigm. Two double-step conditions were used: one enabled the second saccade to be partially programmed in parallel with the first saccade while in a second condition both saccades had to be prepared serially. The intersaccadic interval, observed in the parallel programming (PP) condition, was significantly reduced compared with latency in the serial programming (SP) condition and also to the latency of single saccades in control conditions. The fMRI analysis revealed greater activity (BOLD response) in the frontal and parietal eye fields for the PP condition compared with the SP double-step condition and when compared with the single-saccade control conditions. By contrast, activity in the supplementary eye fields was greater for the double-step condition than the single-step condition but did not distinguish between the PP and SP requirements. The role of the frontal eye fields in PP may be related to the advanced temporal preparation and increased salience of the second saccade goal that may mediate activity in other downstream structures, such as the superior colliculus. The parietal lobes may be involved in the preparation for spatial remapping, which is required in double-step conditions. The supplementary eye fields appear to have a more general role in planning saccade sequences that may be related to error monitoring and the control over the execution of the correct sequence of responses.

  11. Amnesic H.M.'s performance on the language competence test: parallel deficits in memory and sentence production.

    Science.gov (United States)

    MacKay, Donald G; James, Lori E; Hadley, Christopher B

    2008-04-01

    To test conflicting hypotheses regarding amnesic H.M.'s language abilities, this study examined H.M.'s sentence production on the Language Competence Test (Wiig & Secord, 1988). The task for H.M. and 8 education-, age-, and IQ-matched controls was to describe pictures using a single grammatical sentence containing prespecified target words. The results indicated selective deficits in H.M.'s picture descriptions: H.M. produced fewer single grammatical sentences, included fewer target words, and described the pictures less completely and accurately than did the controls. However, H.M.'s deficits diminished with repeated processing of unfamiliar stimuli and disappeared for familiar stimuli-effects that help explain why other researchers have concluded that H.M.'s language production is intact. Besides resolving the conflicting hypotheses, present results replicated other well-controlled sentence production results and indicated that H.M.'s language and memory exhibit parallel deficits and sparing. Present results comport in detail with binding theory but pose problems for current systems theories of H.M.'s condition.

  12. Parallel Profiles of Inflammatory and Effector Memory T Cells in Visceral Fat and Liver of Obesity-Associated Cancer Patients.

    Science.gov (United States)

    Conroy, Melissa J; Galvin, Karen C; Doyle, Suzanne L; Kavanagh, Maria E; Mongan, Ann-Marie; Cannon, Aoife; Moore, Gillian Y; Reynolds, John V; Lysaght, Joanne

    2016-10-01

    In the midst of a worsening obesity epidemic, the incidence of obesity-associated morbidities, including cancer, diabetes, cardiac and liver disease is increasing. Insights into mechanisms underlying pathological obesity-associated inflammation are lacking. Both the omentum, the principal component of visceral fat, and liver of obese individuals are sites of excessive inflammation, but to date the T cell profiles of both compartments have not been assessed or compared in a patient cohort with obesity-associated disease. We have previously identified that omentum is enriched with inflammatory cytokines, chemokines and T cells. Here, we compared the inflammatory profile of T cells in the omentum and liver of patients with the obesity-associated malignancy oesophageal adenocarcinoma (OAC). Furthermore, we assessed the secreted cytokine profile in OAC patient serum, omentum and liver to assess systemic and local inflammation. We observed parallel T cell cytokine profiles and phenotypes in the omentum and liver of OAC patients, in particular CD69(+) and inflammatory effector memory T cells. This study reflects similar processes of inflammation and T cell activation in the omentum and liver, and may suggest common targets to modulate pathological inflammation at these sites.

  13. The Parallel C Preprocessor

    Directory of Open Access Journals (Sweden)

    Eugene D. Brooks III

    1992-01-01

    Full Text Available We describe a parallel extension of the C programming language designed for multiprocessors that provide a facility for sharing memory between processors. The programming model was initially developed on conventional shared memory machines with small processor counts such as the Sequent Balance and Alliant FX/8, but has more recently been used on a scalable massively parallel machine, the BBN TC2000. The programming model is split-join rather than fork-join. Concurrency is exploited to use a fixed number of processors more efficiently rather than to exploit more processors as in the fork-join model. Team splitting, a mechanism to split the team of processors executing a code into subteams to handle parallel subtasks, is used to provide an efficient mechanism to exploit nested concurrency. We have found the split-join programming model to have an inherent implementation advantage, compared to the fork-join model, when the number of processors in a machine becomes large.

  14. Memory-Optimized Software Synthesis from Dataflow Program Graphs with Large Size Data Samples

    Directory of Open Access Journals (Sweden)

    Hyunok Oh

    2003-05-01

    Full Text Available In multimedia and graphics applications, data samples of nonprimitive type require significant amount of buffer memory. This paper addresses the problem of minimizing the buffer memory requirement for such applications in embedded software synthesis from graphical dataflow programs based on the synchronous dataflow (SDF model with the given execution order of nodes. We propose a memory minimization technique that separates global memory buffers from local pointer buffers: the global buffers store live data samples and the local buffers store the pointers to the global buffer entries. The proposed algorithm reduces 67% memory for a JPEG encoder, 40% for an H.263 encoder compared with unshared versions, and 22% compared with the previous sharing algorithm for the H.263 encoder. Through extensive buffer sharing optimization, we believe that automatic software synthesis from dataflow program graphs achieves the comparable code quality with the manually optimized code in terms of memory requirement.

  15. VHDL-based programming environment for Floating-Gate analog memory cell

    Directory of Open Access Journals (Sweden)

    Carlos Alberto dos Reis Filho

    2005-02-01

    Full Text Available An implementation in CMOS technology of a Floating-Gate Analog Memory Cell and Programming Environment is presented. A digital closed-loop control compares a reference value set by user and the memory output and after cycling, the memory output is updated and the new value stored. The circuit can be used as analog trimming for VLSI applications where mechanical trimming associated with postprocessing chip is prohibitive due to high costs.

  16. Facing competition: Neural mechanisms underlying parallel programming of antisaccades and prosaccades.

    Science.gov (United States)

    Talanow, Tobias; Kasparbauer, Anna-Maria; Steffens, Maria; Meyhöfer, Inga; Weber, Bernd; Smyrnis, Nikolaos; Ettinger, Ulrich

    2016-08-01

    The antisaccade task is a prominent tool to investigate the response inhibition component of cognitive control. Recent theoretical accounts explain performance in terms of parallel programming of exogenous and endogenous saccades, linked to the horse race metaphor. Previous studies have tested the hypothesis of competing saccade signals at the behavioral level by selectively slowing the programming of endogenous or exogenous processes e.g. by manipulating the probability of antisaccades in an experimental block. To gain a better understanding of inhibitory control processes in parallel saccade programming, we analyzed task-related eye movements and blood oxygenation level dependent (BOLD) responses obtained using functional magnetic resonance imaging (fMRI) at 3T from 16 healthy participants in a mixed antisaccade and prosaccade task. The frequency of antisaccade trials was manipulated across blocks of high (75%) and low (25%) antisaccade frequency. In blocks with high antisaccade frequency, antisaccade latencies were shorter and error rates lower whilst prosaccade latencies were longer and error rates were higher. At the level of BOLD, activations in the task-related saccade network (left inferior parietal lobe, right inferior parietal sulcus, left precentral gyrus reaching into left middle frontal gyrus and inferior frontal junction) and deactivations in components of the default mode network (bilateral temporal cortex, ventromedial prefrontal cortex) compensated increased cognitive control demands. These findings illustrate context dependent mechanisms underlying the coordination of competing decision signals in volitional gaze control.

  17. Eighth SIAM conference on parallel processing for scientific computing: Final program and abstracts

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1997-12-31

    This SIAM conference is the premier forum for developments in parallel numerical algorithms, a field that has seen very lively and fruitful developments over the past decade, and whose health is still robust. Themes for this conference were: combinatorial optimization; data-parallel languages; large-scale parallel applications; message-passing; molecular modeling; parallel I/O; parallel libraries; parallel software tools; parallel compilers; particle simulations; problem-solving environments; and sparse matrix computations.

  18. Program Suite for Conceptual Designing of Parallel Mechanism-Based Robots and Machine Tools

    Directory of Open Access Journals (Sweden)

    Slobodan Tabaković

    2013-06-01

    This paper describes the categorization of criteria for the conceptual design of parallel mechanism‐based robots or machine tools, resulting from workspace analysis as well as the procedure of their defining. Furthermore, it also presents the designing methodology that was implemented into the program for the creation of a robot or machine tool space model and the optimization of the resulting solution. For verification of the criteria and the programme suite, three common (conceptually different mechanisms with a similar mechanical structure and kinematic characteristics were used.

  19. Speeding Up the String Comparison of the IDS Snort using Parallel Programming: A Systematic Literature Review on the Parallelized Aho-Corasick Algorithm

    Directory of Open Access Journals (Sweden)

    SILVA JUNIOR,J. B.

    2016-12-01

    Full Text Available The Intrusion Detection System (IDS needs to compare the contents of all packets arriving at the network interface with a set of signatures for indicating possible attacks, a task that consumes much CPU processing time. In order to alleviate this problem, some researchers have tried to parallelize the IDS's comparison engine, transferring execution from the CPU to GPU. This paper identifies and maps the parallelization features of the Aho-Corasick algorithm, which is used in Snort to compare patterns, in order to show this algorithm's implementation and execution issues, as well as optimization techniques for the Aho-Corasick machine. We have found 147 papers from important computer science publications databases, and have mapped them. We selected 22 and analyzed them in order to find our results. Our analysis of the papers showed, among other results, that parallelization of the AC algorithm is a new task and the authors have focused on the State Transition Table as the most common way to implement the algorithm on the GPU. Furthermore, we found that some techniques speed up the algorithm and reduce the required machine storage space are highly used, such as the algorithm running on the fastest memories and mechanisms for reducing the number of nodes and bit maping.

  20. Adaptive Memory Coherence Algorithms in DSVM

    Institute of Scientific and Technical Information of China (English)

    周建强; 谢立; 等

    1994-01-01

    Based on the characteristics of distrubuted system and the behavior of parallel programs,this paper presents the fixed and randomized competitive memory coherence algorithms for distributed shared virtual memory.These algorithms exploit parallel programs' locality of reference and exhibit good competitive property.Our simulation shows that the fixed and randomized algorithms achieve better performance and higher stability than other strategies such as write-invalidate and write-update.

  1. Modularity, Working Memory, and Second Language Acquisition: A Research Program

    Science.gov (United States)

    Truscott, John

    2017-01-01

    Considerable reason exists to view the mind, and language within it, as modular, and this view has an important place in research and theory in second language acquisition (SLA) and beyond. But it has had very little impact on the study of working memory and its role in SLA. This article considers the need for modular study of working memory,…

  2. A program system for ab initio MO calculations on vector and parallel processing machines III. Integral reordering and four-index transformation

    Science.gov (United States)

    Wiest, Roland; Demuynck, Jean; Bénard, Marc; Rohmer, Marie-Madeleine; Ernenwein, René

    1991-01-01

    This series of three papers presents a program system for ab initio molecular orbital calculations on vector and parallel computers. Part III is devoted to the four-index transformation on a molecular orbital basis of size NMO of the file of two-electron integrals ( pq∥ rs) generated by a contracted Gaussian set of size NATO (number of atomic orbitals). A fast Yoshimine algorithm first sorts the ( pq∥ rs) integrals with respect to index pq only. This file of half-sorted integrals labelled by their rs-index can be processed without further modification to generate either the transformed integrals or the supermatrix elements. The large memory available on the CRAY-2 has made possible to implement the transformation algorithm proposed by Bender in 1972, which requires a core-storage allocation varying as (NATO) 3. Two versions of Bender's algorithm are included in the present program. The first version is an in-core version, where the complete file of accumulated contributions to transformed integrals is stored and updated in central memory. This version has been parallelized by distributing over a limited number of logical tasks the NATO steps corresponding to the scanning of the most external loop. The second version is an out-of-core version, in which twin fires are alternatively used as input and output for the accumulated contributions to transformed integrals. This version is not parallel. The choice of one or another version and (for version 1) the determination of the number of tasks depends upon the balance between the available and the requested amounts of storage. The storage management and the choice of the proper version are carried out automatically using dynamic storage allocation. Both versions are vectorized and take advantage of the molecular symmetry.

  3. Telephone word-list recall tested in the rural aging and memory study: two parallel versions for the TICS-M.

    Science.gov (United States)

    Hogervorst, Eva; Bandelow, Stephan; Hart, John; Henderson, Victor W

    2004-09-01

    Parallel versions of memory tasks are useful in clinical and research settings to reduce practice effects engendered by multiple administrations. We aimed to investigate the usefulness of three parallel versions of ten-item word list recall tasks administered by telephone. A population based telephone survey of middle-aged and elderly residents of Bradley County, Arkansas was carried out as part of the Rural Aging and Memory Study (RAMS). Participants in the study were 1845 persons aged 40 to 95 years. Word lists included that used in the telephone interview of cognitive status (TICS) as a criterion standard and two newly developed lists. The mean age of participants was 61.05 (SD 12.44) years; 39.5% were over age 65. 78% of the participants had completed high school, 66% were women and 21% were African-American. There was no difference in demographic characteristics between groups receiving different word list versions, and performances on the three versions were equivalent for both immediate (mean 4.22, SD 1.53) and delayed (mean 2.35 SD 1.75) recall trials. The total memory score (immediate+delayed recall) was negatively associated with older age (beta = -0.41, 95%CI=-0.11 to -0.04), lower education (beta = 0.24, 95%CI = 0.36 to 0.51), male gender (beta = -0.18, 95%CI = -1.39 to -0.90) and African-American race (beta = -0.15, 95%CI = -1.41 to -0.82). The two RAMS word recall lists and the TICS word recall list can be used interchangeably in telephone assessment of memory of middle-aged and elderly persons. This finding is important for future studies where parallel versions of a word-list memory task are needed. (250 words).

  4. Parallelism and Scalability in an Image Processing Application

    DEFF Research Database (Denmark)

    Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven

    2009-01-01

    The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chips. This means that parallel processing is required in application areas that traditionally have not used...... parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately...... available parallelism and further extraction of parallelism is limited by small data sets and a relatively high parallelization overhead. Load balance is difficult to obtain due to the limited parallelism and made worse by non-uniform memory latency. Three parallel OpenMP implementations of the application...

  5. Memory Access Behavior Analysis of NUMA-Based Shared Memory Programs

    Directory of Open Access Journals (Sweden)

    Jie Tao

    2002-01-01

    Full Text Available Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.

  6. Highly flexible nearest-neighbor-search associative memory with integrated k nearest neighbor classifier, configurable parallelism and dual-storage space

    Science.gov (United States)

    An, Fengwei; Mihara, Keisuke; Yamasaki, Shogo; Chen, Lei; Jürgen Mattausch, Hans

    2016-04-01

    VLSI-implementations are often applied to solve the high computational cost of pattern matching but have usually low flexibility for satisfying different target applications. In this paper, a digital word-parallel associative memory architecture for k nearest neighbor (KNN) search, which is one of the most basic algorithms in pattern recognition, is reported applying the squared Euclidean distance measure. The reported architecture features reconfigurable parallelism, dual-storage space to achieve a flexible number of reference vectors, and a dedicated majority vote circuit. Programmable switching circuits, located between vector components, enable scalability of the searching parallelism by configuring the reference feature-vector dimensionality. A pipelined storage with dual static-random-access-memory (SRAM) cells for each unit and an intermediate winner control circuit are designed to extend the applicability by improving the flexibility of the reference storage. A test chip in 180 nm CMOS technology, which has 32 rows, 4 elements in each row and 2-parallel 8-bit dual-components in each element, consumes altogether 61.4 mW and in particular only 11.9 mW during the reconfigurable KNN classification (at 45.58 MHz and 1.8 V).

  7. Acceleration of the Geostatistical Software Library (GSLIB) by code optimization and hybrid parallel programming

    Science.gov (United States)

    Peredo, Oscar; Ortiz, Julián M.; Herrero, José R.

    2015-12-01

    The Geostatistical Software Library (GSLIB) has been used in the geostatistical community for more than thirty years. It was designed as a bundle of sequential Fortran codes, and today it is still in use by many practitioners and researchers. Despite its widespread use, few attempts have been reported in order to bring this package to the multi-core era. Using all CPU resources, GSLIB algorithms can handle large datasets and grids, where tasks are compute- and memory-intensive applications. In this work, a methodology is presented to accelerate GSLIB applications using code optimization and hybrid parallel processing, specifically for compute-intensive applications. Minimal code modifications are added decreasing as much as possible the elapsed time of execution of the studied routines. If multi-core processing is available, the user can activate OpenMP directives to speed up the execution using all resources of the CPU. If multi-node processing is available, the execution is enhanced using MPI messages between the compute nodes.Four case studies are presented: experimental variogram calculation, kriging estimation, sequential gaussian and indicator simulation. For each application, three scenarios (small, large and extra large) are tested using a desktop environment with 4 CPU-cores and a multi-node server with 128 CPU-nodes. Elapsed times, speedup and efficiency results are shown.

  8. What is "the patient perspective" in patient engagement programs? Implicit logics and parallels to feminist theories.

    Science.gov (United States)

    Rowland, Paula; McMillan, Sarah; McGillicuddy, Patti; Richards, Joy

    2017-01-01

    Public and patient involvement (PPI) in health care may refer to many different processes, ranging from participating in decision-making about one's own care to participating in health services research, health policy development, or organizational reforms. Across these many forms of public and patient involvement, the conceptual and theoretical underpinnings remain poorly articulated. Instead, most public and patient involvement programs rely on policy initiatives as their conceptual frameworks. This lack of conceptual clarity participates in dilemmas of program design, implementation, and evaluation. This study contributes to the development of theoretical understandings of public and patient involvement. In particular, we focus on the deployment of patient engagement programs within health service organizations. To develop a deeper understanding of the conceptual underpinnings of these programs, we examined the concept of "the patient perspective" as used by patient engagement practitioners and participants. Specifically, we focused on the way this phrase was used in the singular: "the" patient perspective or "the" patient voice. From qualitative analysis of interviews with 20 patient advisers and 6 staff members within a large urban health network in Canada, we argue that "the patient perspective" is referred to as a particular kind of situated knowledge, specifically an embodied knowledge of vulnerability. We draw parallels between this logic of patient perspective and the logic of early feminist theory, including the concepts of standpoint theory and strong objectivity. We suggest that champions of patient engagement may learn much from the way feminist theorists have constructed their arguments and addressed critique.

  9. Transactional Memory

    CERN Document Server

    Harris, Tim; Rajwar, Ravi

    2010-01-01

    The advent of multicore processors has renewed interest in the idea of incorporating transactions into the programming model used to write parallel programs.This approach, known as transactional memory, offers an alternative, and hopefully better, way to coordinate concurrent threads. The ACI(atomicity, consistency, isolation) properties of transactions provide a foundation to ensure that concurrent reads and writes of shared data do not produce inconsistent or incorrect results. At a higher level, a computation wrapped in a transaction executes atomically - either it completes successfullyand

  10. A three-year follow-up of older adult participants in a memory-skills training program.

    Science.gov (United States)

    Scogin, F; Bienias, J L

    1988-12-01

    This study examined the long-term effects of participation in a self-taught memory training program. In all, 27 memory training and 13 nontraining participants were assessed at approximately 3-year follow-ups. Assessment of these groups prior to the introduction of training had revealed nonsignificant differences in memory performance but marked differences in level of memory complaints, with training participants evidencing higher levels of complaints. The current assessment again demonstrated overall nonsignificant differences in memory performance but significant differences in memory complaints between the two groups. More specifically, the training group evidenced significant decreases in memory performance over the 3-year interval, but no significant changes in memory complaints were observed for either group. Thus, memory training appeared to have little long-term effect on memory functioning. Future research should explore long-term maintenance strategies in memory training with older adults.

  11. Television and memory: history programming and contemporary identities

    Directory of Open Access Journals (Sweden)

    Erin Bell

    2011-05-01

    Full Text Available

    Abstract: This article considers recent UK history programming as a lens through which to contemplate the extent to which TV offers the potential for an audience to reflect on their personal past and present identity: ethnic, religious, regional or familial, in a wider public context, whilst also shaping aspects of personal and familial memory to be presented on screen as public memory. Although, as Bill Nichols (156 asserted in the early 1990s, subjectivity and identification are less frequently explored in documentaries than in fiction, I will also consider the extent to which some recent factual programmes on British television have succeeded in doing so, and also viewers’ responses to them.

     

    Résumé: Cet article propose une relecture des politiques de programmation dans l'histoire récente de la télévision britannique. Ces politiques sont l'occasion d'analyser dans quelle mesure la télévision offre à ses spectateurs la possibilité de réfléchir dans un contexte plus large sur leur propre passé et sur leur identit

  12. Lateralized odor preference training in rat pups reveals an enhanced network response in anterior piriform cortex to olfactory input that parallels extended memory.

    Science.gov (United States)

    Fontaine, Christine J; Harley, Carolyn W; Yuan, Qi

    2013-09-18

    The present study examines synaptic plasticity in the anterior piriform cortex (aPC) using ex vivo slices from rat pups given lateralized odor preference training. In the early odor preference learning model, a brief 10 min training session yields 24 h memory, while four daily sessions yield 48 h memory. Odor preference memory can be lateralized through naris occlusion as the anterior commissure is not yet functional. AMPA receptor-mediated postsynaptic responses in the aPC to lateral olfactory tract input, shown to be enhanced at 24 h, are no longer enhanced 48 h after a single training session. Following four spaced lateralized trials, the AMPA receptor-mediated fEPSP is enhanced in the trained aPC at 48 h. Calcium imaging of aPC pyramidal cells within 48 h revealed decreased firing thresholds in the pyramidal cell network. Thus multiday odor preference training induced increased odor input responsiveness in previously weakly activated aPC cells. These results support the hypothesis that increased synaptic strength in olfactory input networks mediates odor preference memory. The increase in aPC network activation parallels behavioral memory.

  13. DOE SBIR Phase-1 Report on Hybrid CPU-GPU Parallel Development of the Eulerian-Lagrangian Barracuda Multiphase Program

    Energy Technology Data Exchange (ETDEWEB)

    Dr. Dale M. Snider

    2011-02-28

    This report gives the result from the Phase-1 work on demonstrating greater than 10x speedup of the Barracuda computer program using parallel methods and GPU processors (General-Purpose Graphics Processing Unit or Graphics Processing Unit). Phase-1 demonstrated a 12x speedup on a typical Barracuda function using the GPU processor. The problem test case used about 5 million particles and 250,000 Eulerian grid cells. The relative speedup, compared to a single CPU, increases with increased number of particles giving greater than 12x speedup. Phase-1 work provided a path for reformatting data structure modifications to give good parallel performance while keeping a friendly environment for new physics development and code maintenance. The implementation of data structure changes will be in Phase-2. Phase-1 laid the ground work for the complete parallelization of Barracuda in Phase-2, with the caveat that implemented computer practices for parallel programming done in Phase-1 gives immediate speedup in the current Barracuda serial running code. The Phase-1 tasks were completed successfully laying the frame work for Phase-2. The detailed results of Phase-1 are within this document. In general, the speedup of one function would be expected to be higher than the speedup of the entire code because of I/O functions and communication between the algorithms. However, because one of the most difficult Barracuda algorithms was parallelized in Phase-1 and because advanced parallelization methods and proposed parallelization optimization techniques identified in Phase-1 will be used in Phase-2, an overall Barracuda code speedup (relative to a single CPU) is expected to be greater than 10x. This means that a job which takes 30 days to complete will be done in 3 days. Tasks completed in Phase-1 are: Task 1: Profile the entire Barracuda code and select which subroutines are to be parallelized (See Section Choosing a Function to Accelerate) Task 2: Select a GPU consultant company and

  14. The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

    Directory of Open Access Journals (Sweden)

    Matthew O'keefe

    1995-01-01

    Full Text Available Massively parallel processors (MPPs hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. We have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.

  15. A Review on Large Scale Graph Processing Using Big Data Based Parallel Programming Models

    Directory of Open Access Journals (Sweden)

    Anuraj Mohan

    2017-02-01

    Full Text Available Processing big graphs has become an increasingly essential activity in various fields like engineering, business intelligence and computer science. Social networks and search engines usually generate large graphs which demands sophisticated techniques for social network analysis and web structure mining. Latest trends in graph processing tend towards using Big Data platforms for parallel graph analytics. MapReduce has emerged as a Big Data based programming model for the processing of massively large datasets. Apache Giraph, an open source implementation of Google Pregel which is based on Bulk Synchronous Parallel Model (BSP is used for graph analytics in social networks like Facebook. This proposed work is to investigate the algorithmic effects of the MapReduce and BSP model on graph problems. The triangle counting problem in graphs is considered as a benchmark and evaluations are made on the basis of time of computation on the same cluster, scalability in relation to graph and cluster size, resource utilization and the structure of the graph.

  16. Visuomotor memory in elderly: effect of a physical exercise program

    Directory of Open Access Journals (Sweden)

    João Silva

    2014-12-01

    Full Text Available Memory, namely visuomotor memory, is one of the most essential cognitive functions in elder’s life. Among others, regular exercise seems to be an important factor in counteracting age-related-cognitive skills changes and thus prevent memory loss. However, in spite of the importance of visuomotor memory, the results of the scarce studies concerning the influence of exercise on this capacity are contradictory. The aim of this study was to investigate the effect of physical exercise (PE in visuomotor memory (VMM of elderly adults in function of gender and age. VMM (time spent in performing the test and errors during the execution of 74 subjects aged 60-90 years, being 36 practitioners of PE (P - mean age of 70.22 ± 0.90 years and 38 non-practitioners (NP - mean age of 68.26 ± 1.12 years were assessed by VMM Test. The results showed that: a P presented a better performance in the time of performing the test and in the number of errors committed compared to NP; b Gender and age did not influence the VMM performance. Data suggest that PE seems to have positive effect in the VMM, independently of gender and age.

  17. Research on Task Parallel Programming Model%任务并行编程模型研究与进展

    Institute of Scientific and Technical Information of China (English)

    王蕾; 崔慧敏; 陈莉; 冯晓兵

    2013-01-01

    Task parallel programming model is a widely used parallel programming model on multi-core platforms.With the intention of simplifying parallel programming and improving the utilization of multiple cores,this paper provides an introduction to the essential programming interfaces and the supporting mechanism used in task parallel programming models and discusses issues and the latest achievements from three perspectives:Parallelism expression,data management and task scheduling.In the end,some future trends in this area are discussed.%任务并行编程模型是近年来多核平台上广泛研究和使用的并行编程模型,旨在简化并行编程和提高多核利用率.首先,介绍了任务并行编程模型的基本编程接口和支持机制;然后,从3个角度,即并行性表达、数据管理和任务调度介绍任务并行编程模型的研究问题、困难和最新研究成果;最后展望了任务并行未来的研究方向.

  18. Structured Parallel Programming: How Informatics Can Help Overcome the Software Dilemma

    Directory of Open Access Journals (Sweden)

    Helmar Burkhart

    1996-01-01

    Full Text Available The state-of-the-art programming of parallel computers is far from being successful. The main challenge today is, therefore, the development of techniques and tools that improve programmers' productivity. Programmability, portability, and reusability are key issues to be solved. In this article we shall report about our ongoing efforts in this direction. After a short discussion of the software dilemma found today, we shall present the Basel approach. We shall summarize our algorithm description methodology and discuss the basic concepts of the proposed skeleton language. An algorithmic example and comments on implementation aspects will explain our work in more detail. We shall summarize the current state of the implementation and conclude with a discussion of related work.

  19. Adaptive Representations for Improving Evolvability, Parameter Control, and Parallelization of Gene Expression Programming

    Directory of Open Access Journals (Sweden)

    Nigel P. A. Browne

    2010-01-01

    Full Text Available Gene Expression Programming (GEP is a genetic algorithm that evolves linear chromosomes encoding nonlinear (tree-like structures. In the original GEP algorithm, the genome size is problem specific and is determined through trial and error. In this work, a method for adaptive control of the genome size is presented. The approach introduces mutation, transposition, and recombination operators that enable a population of heterogeneously structured chromosomes, something the original GEP algorithm does not support. This permits crossbreeding between normally incompatible individuals, speciation within a population, increases the evolvability of the representations, and enhances parallel GEP. To test our approach, an assortment of problems were used, including symbolic regression, classification, and parameter optimization. Our experimental results show that our approach provides a solution for the problem of self-adaptive control of the genome size of GEP's representation.

  20. Impaired hippocampal acetylcholine release parallels spatial memory deficits in Tg2576 mice subjected to basal forebrain cholinergic degeneration

    DEFF Research Database (Denmark)

    Laursen, Bettina; Mørk, Arne; Plath, Niels

    2013-01-01

    (BFCD) in 3 months old male Tg2576 mice to co-express cholinergic degeneration with Aβ overexpression as these characteristics constitutes key hallmarks of AD. At 9 months, SAP lesioned Tg2576 mice were cognitively impaired in two spatial paradigms addressing working memory and mid to long-term memory...

  1. Study on Parallel Computing

    Institute of Scientific and Technical Information of China (English)

    Guo-Liang Chen; Guang-Zhong Sun; Yun-Quan Zhang; Ze-Yao Mo

    2006-01-01

    In this paper, we present a general survey on parallel computing. The main contents include parallel computer system which is the hardware platform of parallel computing, parallel algorithm which is the theoretical base of parallel computing, parallel programming which is the software support of parallel computing. After that, we also introduce some parallel applications and enabling technologies. We argue that parallel computing research should form an integrated methodology of "architecture - algorithm - programming - application". Only in this way, parallel computing research becomes continuous development and more realistic.

  2. Developing Memory Clinics in Primary Care: An Evidence-Based Interprofessional Program of Continuing Professional Development

    Science.gov (United States)

    Lee, Linda; Weston, W. Wayne; Hillier, Loretta M.

    2013-01-01

    Introduction: Primary care is challenged to meet the needs of patients with dementia. A training program was developed to increase capacity for dementia care through the development of Family Health Team (FHT)-based interprofessional memory clinics. The interprofessional training program consisted of a 2-day workshop, 1-day observership, and 2-day…

  3. Combined shared and distributed memory ab-initio computations of molecular-hydrogen systems in the correlated state: Process pool solution and two-level parallelism

    Science.gov (United States)

    Biborski, Andrzej; Kądzielawa, Andrzej P.; Spałek, Józef

    2015-12-01

    An efficient computational scheme devised for investigations of ground state properties of the electronically correlated systems is presented. As an example, (H2)n chain is considered with the long-range electron-electron interactions taken into account. The implemented procedure covers: (i) single-particle Wannier wave-function basis construction in the correlated state, (ii) microscopic parameters calculation, and (iii) ground state energy optimization. The optimization loop is based on highly effective process-pool solution - specific root-workers approach. The hierarchical, two-level parallelism was applied: both shared (by use of Open Multi-Processing) and distributed (by use of Message Passing Interface) memory models were utilized. We discuss in detail the feature that such approach results in a substantial increase of the calculation speed reaching factor of 300 for the fully parallelized solution. The scheme elaborated in detail reflects the situation in which the most demanding task is the single-particle basis optimization.

  4. SKIRT: Hybrid parallelization of radiative transfer simulations

    Science.gov (United States)

    Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.

    2017-07-01

    We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.

  5. Distributed-memory matrix computations

    DEFF Research Database (Denmark)

    Balle, Susanne Mølleskov

    1995-01-01

    in these algorithms is that many scientific applications rely heavily on the performance of the involved dense linear algebra building blocks. Even though we consider the distributed-memory as well as the shared-memory programming paradigm, the major part of the thesis is dedicated to distributed-memory architectures....... We emphasize distributed-memory massively parallel computers - such as the Connection Machines model CM-200 and model CM-5/CM-5E - available to us at UNI-C and at Thinking Machines Corporation. The CM-200 was at the time this project started one of the few existing massively parallel computers...... performance can we expect to achieve? Why? 2.Solving systems of linear equations using a Strassen-type matrix-inversion algorithm. A good way to solve systems of linear equations on massively parallel computers? 3.Aspects of computing the singular value decomposition on the Connec-tion Machine CM-5/CM-5E...

  6. Parallelism and Scalability in an Image Processing Application

    DEFF Research Database (Denmark)

    Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven

    2008-01-01

    The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chip. This means that parallel processing is required in application areas that traditionally have not used...... parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately...... available parallelism. It is difficult to further extract parallelism since the application has small data sets and parallelization overhead is relatively high. There is also a fair amount of load imbalance which is made worse by a non-uniform memory latency. Even so, we show that with some tuning relative...

  7. CAPTURE OF EVENTS MIDI IN PARALLEL WITH FPGAs'

    Directory of Open Access Journals (Sweden)

    M. Peña

    2004-12-01

    Full Text Available The project consists on designing, in FPGA system, a special dynamic memory MCS-S (MIDI Capture System-Segmented to capture, in real time and in parallel form, musical data that come from a group of instruments whilethey play in an orchestra, as well as to obtain their score. Inside the system, each single memory segment storesthe notes corresponding to each instrument. The control system prepares automatically the necessary memory cellsfor each instrument and inserts new notes in each segment in parallel form. The electronic components of thissystem are programmed in VHDL, to carry out later the implementation in FPGA.

  8. Multilevel Resistance Programming in Conductive Bridge Resistive Memory

    Science.gov (United States)

    Mahalanabis, Debayan

    This work focuses on the existence of multiple resistance states in a type of emerging non-volatile resistive memory device known commonly as Programmable Metallization Cell (PMC) or Conductive Bridge Random Access Memory (CBRAM), which can be important for applications such as multi-bit memory as well as non-volatile logic and neuromorphic computing. First, experimental data from small signal, quasi-static and pulsed mode electrical characterization of such devices are presented which clearly demonstrate the inherent multi-level resistance programmability property in CBRAM devices. A physics based analytical CBRAM compact model is then presented which simulates the ion-transport dynamics and filamentary growth mechanism that causes resistance change in such devices. Simulation results from the model are fitted to experimental dynamic resistance switching characteristics. The model designed using Verilog-a language is computation-efficient and can be integrated with industry standard circuit simulation tools for design and analysis of hybrid circuits involving both CMOS and CBRAM devices. Three main circuit applications for CBRAM devices are explored in this work. Firstly, the susceptibility of CBRAM memory arrays to single event induced upsets is analyzed via compact model simulation and experimental heavy ion testing data that show possibility of both high resistance to low resistance and low resistance to high resistance transitions due to ion strikes. Next, a non-volatile sense amplifier based flip-flop architecture is proposed which can help make leakage power consumption negligible by allowing complete shutdown of power supply while retaining its output data in CBRAM devices. Reliability and energy consumption of the flip-flop circuit for different CBRAM low resistance levels and supply voltage values are analyzed and compared to CMOS designs. Possible extension of this architecture for threshold logic function computation using the CBRAM devices as re

  9. Manipulation of stimulus onset delay in reading: evidence for parallel programming of saccades.

    Science.gov (United States)

    Morrison, R E

    1984-10-01

    On-line eye movement recording of 12 subjects who read short stories on a cathode ray tube enabled a test of direct control and preprogramming models of eye movements in reading. Contingent upon eye position, a mask was displayed in place of the letters in central vision after each saccade, delaying the onset of the stimulus in each eye fixation. The duration of the delay was manipulated in fixed or randomized blocks. Although the length of the delay strongly affected the duration of the fixations, there was no difference due to the conditions of delay manipulation, indicating that fixation duration is under direct control. However, not all fixations were lengthened by the period of the delay. Some ended while the mask was still present, suggesting they had been preprogrammed. But these "anticipation" eye movements could not have been completely determined before the fixation was processed because their fixation durations and saccade lengths were affected by the spatial extent of the mask, which varied randomly. Neither preprogramming nor existing serial direct control models of eye guidance can adequately account for these data. Instead, a model with direct control and parallel programming of saccades is proposed to explain the data and eye movements in reading in general.

  10. A pattern recognition system for prostate mass spectra discrimination based on the CUDA parallel programming model

    Science.gov (United States)

    Kostopoulos, Spiros; Glotsos, Dimitris; Sidiropoulos, Konstantinos; Asvestas, Pantelis; Cavouras, Dionisis; Kalatzis, Ioannis

    2014-03-01

    The aim of the present study was to implement a pattern recognition system for the discrimination of healthy from malignant prostate tumors from proteomic Mass Spectroscopy (MS) samples and to identify m/z intervals of potential biomarkers associated with prostate cancer. One hundred and six MS-spectra were studied in total. Sixty three spectra corresponded to healthy cases (PSA 10). The MS-spectra are publicly available from the NCI Clinical Proteomics Database. The pre-processing comprised the steps: denoising, normalization, peak extraction and peak alignment. Due to the enormous number of features that rose from MS-spectra as informative peaks, and in order to secure optimum system design, the classification task was performed by programming in parallel the multiprocessors of an nVIDIA GPU card, using the CUDA framework. The proposed system achieved 98.1% accuracy. The identified m/z intervals displayed significant statistical differences between the two classes and were found to possess adequate discriminatory power in characterizing prostate samples, when employed in the design of the classification system. Those intervals should be further investigated since they might lead to the identification of potential new biomarkers for prostate cancer.

  11. Pattern-Driven Automatic Parallelization

    Directory of Open Access Journals (Sweden)

    Christoph W. Kessler

    1996-01-01

    Full Text Available This article describes a knowledge-based system for automatic parallelization of a wide class of sequential numerical codes operating on vectors and dense matrices, and for execution on distributed memory message-passing multiprocessors. Its main feature is a fast and powerful pattern recognition tool that locally identifies frequently occurring computations and programming concepts in the source code. This tool also works for dusty deck codes that have been "encrypted" by former machine-specific code transformations. Successful pattern recognition guides sophisticated code transformations including local algorithm replacement such that the parallelized code need not emerge from the sequential program structure by just parallelizing the loops. It allows access to an expert's knowledge on useful parallel algorithms, available machine-specific library routines, and powerful program transformations. The partially restored program semantics also supports local array alignment, distribution, and redistribution, and allows for faster and more exact prediction of the performance of the parallelized target code than is usually possible.

  12. Memory optimized parallel LDPC decoder architecture design on GPU%基于GPU的LDPC存储优化并行译码结构设计

    Institute of Scientific and Technical Information of China (English)

    葛帅; 刘荣科; 侯毅

    2013-01-01

    提出了一种基于Nvidia公司Fermi架构图形处理单元(GPU,Graphic Processing Unit)的分层低密度奇偶校验LDPC(Low-Density Parity-Check)码译码算法的译码器结构优化设计.利用GPU架构的并行性特点,采用帧间与层内双重并行的处理方式,充分利用流多处理器硬件资源,有效缓解了分层译码算法并行度受限的问题.此外,通过采取片上constant memory存储器压缩存储校验矩阵以及利用片外global memory存储器对译码迭代信息进行联合访问的优化方法,有效降低了访存延迟,提高了译码吞吐率.测试结果表明,通过采用多帧并行处理和存储器访问优化可以提升基于GPU的LDPC译码器吞吐率14.9 ~34.8倍.%An optimized decoding architecture was proposed for low-density parity-check (LDPC) codes layered decoding algorithm based on Nvidia's Fermi graphic processing unit (GPU). In accordance with the parallelism characteristics in GPU hardware structure, inter-frame and intra-layer parallelization processing were adopted to fully utilize the resource of streaming multiprocessors ( SM) and mitigate the decoding parallelism limitation in layered decoding algorithm. Secondly, by compressed storing parity-check matrix in on-chip constant memory and coalescing access the exchange information in off-chip global memory, the memory access latency was reduced, and hence the decoding throughput was improved. Simulation results show that 14. 9x to 34. 8x speed-up for decoding throughput is obtained by using multi-frame processing and memory access optimization on GPU platform.

  13. Identifying Inter-task Communication in Shared Memory Programming Models

    DEFF Research Database (Denmark)

    Larsen, Per; Karlsson, Sven; Madsen, Jan

    2009-01-01

    Modern computers often use multi-core architectures, cov- ering clusters of homogeneous cores for high performance computing,to heterogeneous architectures typically found in embedded systems. To efficiently program such architectures, it is important to be able to par- tition and map programs on...

  14. A new metric enabling an exact hypergraph model for the communication volume in distributed-memory parallel applications

    NARCIS (Netherlands)

    Fortmeier, O.; Bücker, H.M.; Fagginger Auer, B.O.; Bisseling, R.H.

    2013-01-01

    A hypergraph model for mapping applications with an all-neighbor communication pattern to distributed-memory computers is proposed, which originated in finite element triangulations. Rather than approximating the communication volume for linear algebra operations, this new model represents the commu

  15. SPSS and SAS programs for determining the number of components using parallel analysis and velicer's MAP test.

    Science.gov (United States)

    O'Connor, B P

    2000-08-01

    Popular statistical software packages do not have the proper procedures for determining the number of components in factor and principal components analyses. Parallel analysis and Velicer's minimum average partial (MAP) test are validated procedures, recommended widely by statisticians. However, many researchers continue to use alternative, simpler, but flawed procedures, such as the eigenvalues-greater-than-one rule. Use of the proper procedures might be increased if these procedures could be conducted within familiar software environments. This paper describes brief and efficient programs for using SPSS and SAS to conduct parallel analyses and the MAP test.

  16. A component analysis based on serial results analyzing performance of parallel iterative programs

    Energy Technology Data Exchange (ETDEWEB)

    Richman, S.C. [Dalhousie Univ. (Canada)

    1994-12-31

    This research is concerned with the parallel performance of iterative methods for solving large, sparse, nonsymmetric linear systems. Most of the iterative methods are first presented with their time costs and convergence rates examined intensively on sequential machines, and then adapted to parallel machines. The analysis of the parallel iterative performance is more complicated than that of serial performance, since the former can be affected by many new factors, such as data communication schemes, number of processors used, and Ordering and mapping techniques. Although the author is able to summarize results from data obtained after examining certain cases by experiments, two questions remain: (1) How to explain the results obtained? (2) How to extend the results from the certain cases to general cases? To answer these two questions quantitatively, the author introduces a tool called component analysis based on serial results. This component analysis is introduced because the iterative methods consist mainly of several basic functions such as linked triads, inner products, and triangular solves, which have different intrinsic parallelisms and are suitable for different parallel techniques. The parallel performance of each iterative method is first expressed as a weighted sum of the parallel performance of the basic functions that are the components of the method. Then, one separately examines the performance of basic functions and the weighting distributions of iterative methods, from which two independent sets of information are obtained when solving a given problem. In this component approach, all the weightings require only serial costs not parallel costs, and each iterative method for solving a given problem is represented by its unique weighting distribution. The information given by the basic functions is independent of iterative method, while that given by weightings is independent of parallel technique, parallel machine and number of processors.

  17. Parallel processing ITS

    Energy Technology Data Exchange (ETDEWEB)

    Fan, W.C.; Halbleib, J.A. Sr.

    1996-09-01

    This report provides a users` guide for parallel processing ITS on a UNIX workstation network, a shared-memory multiprocessor or a massively-parallel processor. The parallelized version of ITS is based on a master/slave model with message passing. Parallel issues such as random number generation, load balancing, and communication software are briefly discussed. Timing results for example problems are presented for demonstration purposes.

  18. Location pairs: a test coverage metric for shared-memory concurrent programs

    OpenAIRE

    Keremoğlu, M. Erkan; Taşıran, Serdar; Muslu, Kıvanç

    2012-01-01

    We present a coverage metric targeted at shared-memory concurrent programs: the Location Pairs (LP) coverage metric. The goals of this metric are (i) to measure how thoroughly a program has been tested from a concurrency standpoint, i.e., whether enough qualitatively different thread interleavings have been explored, and (ii) to guide testing towards unexplored concurrency scenarios. This metric was inspired by an access pattern known to lead to high-level concurrency errors in industrial sof...

  19. Memory-Level Parallelism and Processor Microarchitecture%存储级并行与处理器微体系结构

    Institute of Scientific and Technical Information of China (English)

    谢伦国; 刘德峰

    2011-01-01

    As the gap between processor and memory performance increases, performance loss due to long-latency memory accesses become a primary problem. Memory-level parallelism (MLP) improves performance by accessing memory concurrently. In this paper, the authors review MLP's background, then give an introduction of the conception of MLP and the relation between MLP and processor performance model, and analyze the main limitation to the processor's MLP, emphatically detail all kinds of technologies to improve processor's MLP, at last, summarize the current existing issues, and provide some further interesting directions.%随着处理器和主存之间性能差距的不断增大,长延迟访存成为影响处理器性能的主要原因之一.存储级并行通过多个访存并行执行减少长延迟访存对处理器性能的影响.文中回顾了存储级并行出现的背景,介绍了存储级并行的概念及其与处理器性能模型之间的关系;分析了限制处理器存储级并行的主要因素;详细综述了提高处理器存储级并行的各种技术,进行了分析比较;最后分析讨论了该领域研究存在的问题和进一步的研究方向.

  20. A hot hole-programmed and low-temperature-formed SONOS flash memory.

    Science.gov (United States)

    Chang, Yuan-Ming; Yang, Wen-Luh; Liu, Sheng-Hsien; Hsiao, Yu-Ping; Wu, Jia-Yo; Wu, Chi-Chang

    2013-07-31

    In this study, a high-performance TixZrySizO flash memory is demonstrated using a sol-gel spin-coating method and formed under a low annealing temperature. The high-efficiency charge storage layer is formed by depositing a well-mixed solution of titanium tetrachloride, silicon tetrachloride, and zirconium tetrachloride, followed by 60 s of annealing at 600°C. The flash memory exhibits a noteworthy hot hole trapping characteristic and excellent electrical properties regarding memory window, program/erase speeds, and charge retention. At only 6-V operation, the program/erase speeds can be as fast as 120:5.2 μs with a 2-V shift, and the memory window can be up to 8 V. The retention times are extrapolated to 106 s with only 5% (at 85°C) and 10% (at 125°C) charge loss. The barrier height of the TixZrySizO film is demonstrated to be 1.15 eV for hole trapping, through the extraction of the Poole-Frenkel current. The excellent performance of the memory is attributed to high trapping sites of the low-temperature-annealed, high-κ sol-gel film.

  1. Logic, Design & Organization of PTVD-SHAM; A Parallel Time Varying & Data Super-helical Access Memory

    CERN Document Server

    Alipour, P B

    2007-01-01

    This paper encompasses a super helical memory system's design, 'Boolean logic & image-logic' as a theoretical concept of an invention-model to 'store time-data' in terms of anticipating the best memory location ever for data/time. A waterfall effect is deemed to assist the process of potential-difference output-switch into diverse logic states in quantum dot computational methods via utilizing coiled carbon nanotubes (CCNTs) and carbon nanotube field effect transistors (CNFETs). A 'quantum confinement' is thus derived for a flow of particles in a categorized quantum well substrate with a normalized capacitance rectifying high B-field flux into electromagnetic induction. Multi-access of coherent sequences of 'qubit addressing' is gained in any magnitude as pre-defined with 'no need for say e.g. memory address accessibility and sorting data based on traditional 'big O notation' here, asymptotically confined into singularity' whilst possessing a magnitude of 'infinity' for the orientation of array displaceme...

  2. Increased entorhinal-prefrontal theta synchronization parallels decreased entorhinal-hippocampal theta synchronization during learning and consolidation of associative memory.

    Directory of Open Access Journals (Sweden)

    Kaori eTakehara-Nishiuchi

    2012-01-01

    Full Text Available Memories are thought to be encoded as a distributed representation in the neocortex. The medial prefrontal cortex (mPFC has been shown to support the expression of memories that initially depend on the hippocampus (HPC, yet the mechanisms by which the HPC and mPFC access the distributed representations in the neocortex are unknown. By measuring phase synchronization of local field potential (LFP oscillations, we found that learning initiated changes in neuronal communication of the HPC and mPFC with the lateral entorhinal cortex (LEC, an area that is connected with many other neocortical regions. LFPs were recorded simultaneously from the three brain regions while rats formed an association between an auditory stimulus (CS and eyelid stimulation (US in a trace eyeblink conditioning paradigm, as well as during retention one month following learning. Over the course of learning, theta oscillations in the LEC and mPFC became strongly synchronized following the presentation of the CS on trials in which rats exhibited a conditioned response (CR, and this strengthened synchronization was also observed during retention one month after learning. In contrast, CS-evoked theta synchronization between the LEC and HPC decreased with learning. Our results suggest that the communication between the LEC and mPFC is strengthened with learning whereas the communication between the LEC and HPC is concomitantly weakened, suggesting that enhanced LEC-mPFC communication may be a key process for theoretically-proposed neocortical reorganization accompanying encoding and consolidation of a memory.

  3. Parallel sorting algorithms

    CERN Document Server

    Akl, Selim G

    1985-01-01

    Parallel Sorting Algorithms explains how to use parallel algorithms to sort a sequence of items on a variety of parallel computers. The book reviews the sorting problem, the parallel models of computation, parallel algorithms, and the lower bounds on the parallel sorting problems. The text also presents twenty different algorithms, such as linear arrays, mesh-connected computers, cube-connected computers. Another example where algorithm can be applied is on the shared-memory SIMD (single instruction stream multiple data stream) computers in which the whole sequence to be sorted can fit in the

  4. On the design of BSB neural associative memories using semidefinite programming.

    Science.gov (United States)

    Park, J; Cho, H; Park, D

    1999-11-15

    This article is concerned with the reliable search for optimally performing BSB (brain state in a box) neural associative memories given a set of prototype patterns to be stored as stable equilibrium points. By converting and/or modifying the nonlinear constraints of a known formulation for the synthesis of BSB-based associative memories into linear matrix inequalities, we recast the synthesis into semidefinite programming problems and solve them by recently developed interior point methods. The validity of this approach is illustrated by a design example.

  5. Can We Efficiently Check Concurrent Programs Under Relaxed Memory Models in Maude?

    DEFF Research Database (Denmark)

    Arrahman, Yehia Abd; Andric, Marina; Beggiato, Alessandro

    2014-01-01

    to the state space explosion. Several techniques have been proposed to mitigate those problems so to make verification under relaxed memory models feasible. We discuss how to adopt some of those techniques in a Maude-based approach to language prototyping, and suggest the use of other techniques that have been......Relaxed memory models offer suitable abstractions of the actual optimizations offered by multi-core architectures and by compilers of concurrent programming languages. Using such abstractions for verification purposes is challenging in part due to their inherent non-determinism which contributes...

  6. Expressing Coarse-Grain Dependencies Among Tasks in Shared Memory Programs

    DEFF Research Database (Denmark)

    Larsen, Per; Karlsson, Sven; Madsen, Jan

    2011-01-01

    Designers of embedded systems face tight constraints on resources, response time and cost. The ability to analyze embedded systems is essential to timely delivery of new designs. Many analysis techniques model parallel programs as task graphs. Task graphs capture the worst-case execution times of...

  7. ParaHaplo: A program package for haplotype-based whole-genome association study using parallel computing

    Directory of Open Access Journals (Sweden)

    Kamatani Naoyuki

    2009-10-01

    Full Text Available Abstract Background Since more than a million single-nucleotide polymorphisms (SNPs are analyzed in any given genome-wide association study (GWAS, performing multiple comparisons can be problematic. To cope with multiple-comparison problems in GWAS, haplotype-based algorithms were developed to correct for multiple comparisons at multiple SNP loci in linkage disequilibrium. A permutation test can also control problems inherent in multiple testing; however, both the calculation of exact probability and the execution of permutation tests are time-consuming. Faster methods for calculating exact probabilities and executing permutation tests are required. Methods We developed a set of computer programs for the parallel computation of accurate P-values in haplotype-based GWAS. Our program, ParaHaplo, is intended for workstation clusters using the Intel Message Passing Interface (MPI. We compared the performance of our algorithm to that of the regular permutation test on JPT and CHB of HapMap. Results ParaHaplo can detect smaller differences between 2 populations than SNP-based GWAS. We also found that parallel-computing techniques made ParaHaplo 100-fold faster than a non-parallel version of the program. Conclusion ParaHaplo is a useful tool in conducting haplotype-based GWAS. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computing--such as that used in ParaHaplo--will become increasingly important. The executable binaries and program sources of ParaHaplo are available at the following address: http://sourceforge.jp/projects/parallelgwas/?_sl=1

  8. ParaHaplo: A program package for haplotype-based whole-genome association study using parallel computing.

    Science.gov (United States)

    Misawa, Kazuharu; Kamatani, Naoyuki

    2009-10-21

    Since more than a million single-nucleotide polymorphisms (SNPs) are analyzed in any given genome-wide association study (GWAS), performing multiple comparisons can be problematic. To cope with multiple-comparison problems in GWAS, haplotype-based algorithms were developed to correct for multiple comparisons at multiple SNP loci in linkage disequilibrium. A permutation test can also control problems inherent in multiple testing; however, both the calculation of exact probability and the execution of permutation tests are time-consuming. Faster methods for calculating exact probabilities and executing permutation tests are required. We developed a set of computer programs for the parallel computation of accurate P-values in haplotype-based GWAS. Our program, ParaHaplo, is intended for workstation clusters using the Intel Message Passing Interface (MPI). We compared the performance of our algorithm to that of the regular permutation test on JPT and CHB of HapMap. ParaHaplo can detect smaller differences between 2 populations than SNP-based GWAS. We also found that parallel-computing techniques made ParaHaplo 100-fold faster than a non-parallel version of the program. ParaHaplo is a useful tool in conducting haplotype-based GWAS. Since the data sizes of such projects continue to increase, the use of fast computations with parallel computing--such as that used in ParaHaplo--will become increasingly important. The executable binaries and program sources of ParaHaplo are available at the following address: http://sourceforge.jp/projects/parallelgwas/?_sl=1.

  9. Scalable parallel programming for high performance seismic simulation on petascale heterogeneous supercomputers

    Science.gov (United States)

    Zhou, Jun

    The 1994 Northridge earthquake in Los Angeles, California, killed 57 people, injured over 8,700 and caused an estimated $20 billion in damage. Petascale simulations are needed in California and elsewhere to provide society with a better understanding of the rupture and wave dynamics of the largest earthquakes at shaking frequencies required to engineer safe structures. As the heterogeneous supercomputing infrastructures are becoming more common, numerical developments in earthquake system research are particularly challenged by the dependence on the accelerator elements to enable "the Big One" simulations with higher frequency and finer resolution. Reducing time to solution and power consumption are two primary focus area today for the enabling technology of fault rupture dynamics and seismic wave propagation in realistic 3D models of the crust's heterogeneous structure. This dissertation presents scalable parallel programming techniques for high performance seismic simulation running on petascale heterogeneous supercomputers. A real world earthquake simulation code, AWP-ODC, one of the most advanced earthquake codes to date, was chosen as the base code in this research, and the testbed is based on Titan at Oak Ridge National Laboraratory, the world's largest hetergeneous supercomputer. The research work is primarily related to architecture study, computation performance tuning and software system scalability. An earthquake simulation workflow has also been developed to support the efficient production sets of simulations. The highlights of the technical development are an aggressive performance optimization focusing on data locality and a notable data communication model that hides the data communication latency. This development results in the optimal computation efficiency and throughput for the 13-point stencil code on heterogeneous systems, which can be extended to general high-order stencil codes. Started from scratch, the hybrid CPU/GPU version of AWP

  10. Adapting algorithms to massively parallel hardware

    CERN Document Server

    Sioulas, Panagiotis

    2016-01-01

    In the recent years, the trend in computing has shifted from delivering processors with faster clock speeds to increasing the number of cores per processor. This marks a paradigm shift towards parallel programming in which applications are programmed to exploit the power provided by multi-cores. Usually there is gain in terms of the time-to-solution and the memory footprint. Specifically, this trend has sparked an interest towards massively parallel systems that can provide a large number of processors, and possibly computing nodes, as in the GPUs and MPPAs (Massively Parallel Processor Arrays). In this project, the focus was on two distinct computing problems: k-d tree searches and track seeding cellular automata. The goal was to adapt the algorithms to parallel systems and evaluate their performance in different cases.

  11. Computer-Aided Parallelizer and Optimizer

    Science.gov (United States)

    Jin, Haoqiang

    2011-01-01

    The Computer-Aided Parallelizer and Optimizer (CAPO) automates the insertion of compiler directives (see figure) to facilitate parallel processing on Shared Memory Parallel (SMP) machines. While CAPO currently is integrated seamlessly into CAPTools (developed at the University of Greenwich, now marketed as ParaWise), CAPO was independently developed at Ames Research Center as one of the components for the Legacy Code Modernization (LCM) project. The current version takes serial FORTRAN programs, performs interprocedural data dependence analysis, and generates OpenMP directives. Due to the widely supported OpenMP standard, the generated OpenMP codes have the potential to run on a wide range of SMP machines. CAPO relies on accurate interprocedural data dependence information currently provided by CAPTools. Compiler directives are generated through identification of parallel loops in the outermost level, construction of parallel regions around parallel loops and optimization of parallel regions, and insertion of directives with automatic identification of private, reduction, induction, and shared variables. Attempts also have been made to identify potential pipeline parallelism (implemented with point-to-point synchronization). Although directives are generated automatically, user interaction with the tool is still important for producing good parallel codes. A comprehensive graphical user interface is included for users to interact with the parallelization process.

  12. The Effect of Programmed Physical Exercise to Attention and Working Memory Score in Medical Students

    Directory of Open Access Journals (Sweden)

    Kevin Fachri Muhammad

    2015-06-01

    Full Text Available Background: Attention and working memory are two cognitive domain crucial for activities of daily living. Physical exercise increases the level of BDNF, IGF-1, and VEGF which contributes in attention and working memory processes.This study was conducted to analyze improvement of attention and working memory after programmed physical exercise of Pendidikan Dasar XXI Atlas Medical Pioneer (Pendas XXI AMP. Methods: An analytic observational study was conducted on 47 students from Faculty of Medicine, Universitas Padjadjaran during September-November 2012. Attention was assessed using digit span backward test, stroop test, visual search task, and trail making test. Working memory was assessed using digit span forward test and digit symbol test. Assessment was done on the 11th and 19th week of Pendas XXI AMP. Data distribution was tested first using a test of normality, and then analyzed using T-Dependent Test and Wilcoxon Test Results: Significant improvement was noted for attention in males based on working time for stroop test (26.50±5.66 to 22.03±3.78 seconds, working memory in males based on digit symbol test score (43.96±6.14 to 53.36±5.26 points, attention in females based on reaction time of visual search task for target absent (0.92±0.07 to 0.87±0.07 seconds, and working memory in females based on digit span forward score (5.42±1.30 to 6.63±1.07 points and digit symbol test score (42.47±5.95 to 53.84±5.33 points. Conclusions: Exercise in Pendas XXI AMP improves attention and working memory for college students in Faculty of Medicine Universitas Padjadjaran.

  13. Libraries and Development Environments for Monte Carlo Simulations of Lattice Gauge Theories on Parallel Computers

    Science.gov (United States)

    Decker, K. M.; Jayewardena, C.; Rehmann, R.

    We describe the library lgtlib, and lgttool, the corresponding development environment for Monte Carlo simulations of lattice gauge theory on multiprocessor vector computers with shared memory. We explain why distributed memory parallel processor (DMPP) architectures are particularly appealing for compute-intensive scientific applications, and introduce the design of a general application and program development environment system for scientific applications on DMPP architectures.

  14. Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels

    KAUST Repository

    Haidar, Azzam

    2011-01-01

    This paper introduces a novel implementation in reducing a symmetric dense matrix to tridiagonal form, which is the preprocessing step toward solving symmetric eigenvalue problems. Based on tile algorithms, the reduction follows a two-stage approach, where the tile matrix is first reduced to symmetric band form prior to the final condensed structure. The challenging trade-off between algorithmic performance and task granularity has been tackled through a grouping technique, which consists of aggregating fine-grained and memory-aware computational tasks during both stages, while sustaining the application\\'s overall high performance. A dynamic runtime environment system then schedules the different tasks in an out-of-order fashion. The performance for the tridiagonal reduction reported in this paper is unprecedented. Our implementation results in up to 50-fold and 12-fold improvement (130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000×24000. Copyright 2011 ACM.

  15. 基于CC-NUMA系统模拟器的并行程序性能分析%Performance Analysis of Parallel Programs Based on the CC-NUMA System Simulator

    Institute of Scientific and Technical Information of China (English)

    陈渝; 庞立会; 杨学军; 陈福接

    2001-01-01

    针对CC-NUMA并行系统的特点,本文描述了模拟器—AMY的设计与实现。该模拟器运行在x86 PC机上的Linux操作系统环境下,采用多项优化技术,能够较精确地统计并行程序的时间开销和CC-NUMA并行系统的各项参数,具有执行速度快、精度高和内存开销小等特点。在AMY模拟器环境下,通过对几个典型的并行测试程序的模拟执行,文章给出了统计的模拟结果,分析了并行测试程序的执行行为和开销,最后得出了在CC-NUMA并行系统中对并行程序进行性能优化的有益的指导原则。%According to the characteristics of the CC-NUMA parallel system,this paper introduces a simulator for the CC-NUMA system AMY that runs on the Linux Operating System in common x86 PCs.AMY adopts many optimization technologies,so it can state the time overhead of parallel programs and statistic parameters of CC-NUMA very fast and accurately with little memory overhead.According to the execution of some parallel benchmarks on AMY,this paper state the simulated results,analyzes the behavior and overhead of parallel benchmarks and concludes the helpful guideline of optimizing parallel programs in the CC-NUMA parallel system.

  16. Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization

    KAUST Repository

    Belli, Roberto

    2015-05-01

    Remote Memory Access (RMA) programming enables direct access to low-level hardware features to achieve high performance for distributed-memory programs. However, the design of RMA programming schemes focuses on the memory access and less on the synchronization. For example, in contemporary RMA programming systems, the widely used producer-consumer pattern can only be implemented inefficiently, incurring in an overhead of an additional round-trip message. We propose Notified Access, a scheme where the target process of an access can receive a completion notification. This scheme enables direct and efficient synchronization with a minimum number of messages. We implement our scheme in an open source MPI-3 RMA library and demonstrate lower overheads (two cache misses) than other point-to-point synchronization mechanisms for each notification. We also evaluate our implementation on three real-world benchmarks, a stencil computation, a tree computation, and a Colicky factorization implemented with tasks. Our scheme always performs better than traditional message passing and other existing RMA synchronization schemes, providing up to 50% speedup on small messages. Our analysis shows that Notified Access is a valuable primitive for any RMA system. Furthermore, we provide guidance for the design of low-level network interfaces to support Notified Access efficiently.

  17. Development of the ubiquitous spaced retrieval-based memory advancement and rehabilitation training program.

    Science.gov (United States)

    Han, Ji Won; Oh, Kyusoo; Yoo, Sooyoung; Kim, Eunhye; Ahn, Ki-Hwan; Son, Yeon-Joo; Kim, Tae Hui; Chi, Yeon Kyung; Kim, Ki Woong

    2014-01-01

    The Ubiquitous Spaced Retrieval-based Memory Advancement and Rehabilitation Training (USMART) program was developed by transforming the spaced retrieval-based memory training which consisted of 24 face-to-face sessions into a self-administered program with an iPAD app. The objective of this study was to evaluate the feasibility and efficacy of USMART in elderly subjects with mild cognitive impairment (MCI). Feasibility was evaluated by checking the satisfaction of the participants with a 5-point Likert scale. The efficacy of the program on cognitive functions was evaluated by the Korean version of the Consortium to Establish a Registry for Alzheimer's Disease Neuropsychological Assessment Battery before and after USMART. Among the 10 participants, 7 completed both pre- and post-USMART assessments. The overall satisfaction score was 8.0±1.0 out of 10. The mean Word List Memory Test (WLMT) scores significantly increased after USMART training after adjusting for age, educational levels, baseline Mini-Mental Status Examination scores, and the number of training sessions (pre-USMART, 16.0±4.1; post-USMART, 17.9±4.5; p=0.014, RM-ANOVA). The magnitude of the improvements in the WLMT scores significantly correlated with the number of training sessions during 4 weeks (r=0.793; p=0.033). USMART was effective in improving memory and was well tolerated by most participants with MCI, suggesting that it may be a convenient and cost-effective alternative for the cognitive rehabilitation of elderly subjects with cognitive impairments. Further studies with large numbers of participants are necessary to examine the relationship between the number of training sessions and the improvements in memory function.

  18. Parallel R

    CERN Document Server

    McCallum, Ethan

    2011-01-01

    It's tough to argue with R as a high-quality, cross-platform, open source statistical software product-unless you're in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets. You'll learn the basics of Snow, Multicore, Parallel, and some Hadoop-related tools, including how to find them, how to use them, when they work well, and when they don't. With these packages, you can overcome R's single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R's memory barrier.

  19. Research on Memory Behavior of Java Programs%Java程序内存行为研究

    Institute of Scientific and Technical Information of China (English)

    章婧; 卢凯; 周旭

    2011-01-01

    The research on memory behavior of Java programs is the primary work for energy consumption optimization on Java platform storage management system. This paper tests a large number of memory behavior data of typical Java applications. Through a-nalysis of the data, we find obvious qualities of Java application's memory allocation model and memory usage tracks, including the staged representation, periodicity and stationarity. These qualities have a significant importance for the Java virtual machine garbage collection and Java programs energy consumption optimization.%Java程序内存行为研究是对Java平台存储管理系统进行能耗优化的首要工作.测试了大量典型Java应用程序的内存行为数据,通过对数据的分析发现Java程序的内存分配模式和内存使用轨迹存在明显的规律.最终得出了Java程序内存行为具有阶段性、周期性和平稳性等结论,这些规律对于Java虚拟机优化垃圾收集和Java程序的能耗优化有着重要的指导意义.

  20. 64 kbit Ferroelectric-Gate-Transistor-Integrated NAND Flash Memory with 7.5 V Program and Long Data Retention

    Science.gov (United States)

    Zhang, Xizhen; Takahashi, Mitsue; Takeuchi, Ken; Sakai, Shigeki

    2012-04-01

    A 64 kbit (kb) one-transistor-type ferroelectric memory array was fabricated and characterized. Pt/SrBi2Ta2O9/Hf-Al-O/Si ferroelectric-gate field-effect transistors (FeFETs) were used as the memory cells. The gate length and width were 5 and 5 µm, respectively. The array design was based on NAND flash memory organized as 8 word lines × 32 blocks × 256 bit lines. Erase, program, and nondestructive-read operations were demonstrated in every block. Threshold-voltage (Vth) reading of all the 64 kb memory cells showed a clear separation between their all-erased and all-programmed states. A checkerboard pattern was also programmed in a block and the two distinguishable Vth distributions were read out. The Vth retention of a block of 2 kb memory cells showed no significant degradation after two days.

  1. Static Data Race Detection for X10 Parallel Programs%X10并行程序中静态数据竞争检测

    Institute of Scientific and Technical Information of China (English)

    王旭; 陈雨亭

    2012-01-01

    A multi-threaded program can contain a data race when two or more threads access the same memory location under no ordering constraints, and at least one access is a write operation. The existence of data races can lead to many kinds of harmful program behaviors, including determinism violations, corrupted memory, and so on. It proposes a new algorithm for static detection of data races in X10 parallel programs, which contains four steps: pairs of sources accesses computation, pairs of reachable accesses computation, pairs of clockwise accesses computation, and escape analysis of accessing pairs computation. The essential idea of this approach is to construct the call graph of the program on the basis of the WALA, and then to compute pairs of sources accesses for detecting potential data races. Experimental results show that the algorithm performs well and can find and report potential data races in a cost-effective manner.%在多线程程序中,当2个以上线程在没有顺序约束的条件下访问同一个存储单元时,且其中至少有一个为写访问,则可能会发生数据竞争.为此,提出一种针对X10并行程序的静态数据竞争检测算法,包括源访存对计算、可达访存对计算、时钟同步访存对计算和逃逸访存对计算4个阶段.通过在WALA框架中分析构建程序的调用图,计算源访存对集合,检测出内存访存中可能发生数据竞争的无序对.实验结果表明,该算法可以在不显著增加X10并行程序总体运行时间的情况下,达到比较理想的数据竞争检测效果.

  2. Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

    Directory of Open Access Journals (Sweden)

    Robert Gerstenberger

    2014-01-01

    Full Text Available Modern interconnects offer remote direct memory access (RDMA features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.

  3. 面向任务的TBB多核集群混合并行编程模型%TBB Task-oriented Mixed-parallel Programming Model for Multi-Core Cluster

    Institute of Scientific and Technical Information of China (English)

    顾慧; 郑晓薇; 张建强; 吴华平

    2011-01-01

    构建了一种适用于多核集群的混合并行编程模型.该模型融合了共享内存的面向任务的TBB编程和基于消息传递的MPI编程两种模式.结合两者的优势,实现进程到处理节点和进程内线程到处理器核的两级并行.相对于单一编程方式下的程序性能,采用这种混合并行编程模型的算法不但可以减少程序执行时间,获得更好的加速比和执行效率,而且明显地提高了集群性能.%To take full advantage of the structural characteristics of multi-core clusters and enhance the use of CPU each core efficiently, a multi-core cluster hybrid parallel programming model is constructed. The model combines the shared memory and the TBB task-oriented programming and MPI message-passing programming mode. Using their advantages, achieve the hierarchical parallel programming of processes to nodes and threads to the processor core.Compare with the algorithm of the single parallel programming model, this programming mode not only can reduce the program execution time, get a better speedup and execution efficiency, but also improve the cluster performance greatly.

  4. Automatic generation of scheduling and communication code in real-time parallel programs

    NARCIS (Netherlands)

    Bakkers, André; Sunter, Johan; Ploeg, Evert

    1995-01-01

    Inter-process communication and scheduling are notorious problem areas in the design of real-time systems. Using CASE tools, the system design phase will in general result in a system description in the form of parallel processes. Manual allocation of these processes to processors may result in erro

  5. A survey of parallel execution strategies for transitive closure and logic programs

    NARCIS (Netherlands)

    Cacace, F.; Ceri, S.; Houtsma, M.A.W.

    1993-01-01

    An important feature of database technology of the nineties is the use of parallelism for speeding up the execution of complex queries. This technology is being tested in several experimental database architectures and a few commercial systems for conventional select-project-join queries. In

  6. Benefits of a Classroom Based Instrumental Music Program on Verbal Memory of Primary School Children: A Longitudinal Study

    Science.gov (United States)

    Rickard, Nikki S.; Vasquez, Jorge T.; Murphy, Fintan; Gill, Anneliese; Toukhsati, Samia R.

    2010-01-01

    Previous research has demonstrated a benefit of music training on a number of cognitive functions including verbal memory performance. The impact of school-based music programs on memory processes is however relatively unknown. The current study explored the effect of increasing frequency and intensity of classroom-based instrumental training…

  7. Understanding Notional Machines through Traditional Teaching with Conceptual Contraposition and Program Memory Tracing

    Directory of Open Access Journals (Sweden)

    Jeisson Hidalgo-Céspedes

    2016-08-01

    Full Text Available A correct understanding about how computers run code is mandatory in order to effectively learn to program. Lectures have historically been used in programming courses to teach how computers execute code, and students are assessed through traditional evaluation methods, such as exams. Constructivism learning theory objects to students’ passiveness during lessons, and traditional quantitative methods for evaluating a complex cognitive process such as understanding. Constructivism proposes complimentary techniques, such as conceptual contraposition and colloquies. We enriched lectures of a “Programming II” (CS2 course combining conceptual contraposition with program memory tracing, then we evaluated students’ understanding of programming concepts through colloquies. Results revealed that these techniques applied to the lecture are insufficient to help students develop satisfactory mental models of the C++ notional machine, and colloquies behaved as the most comprehensive traditional evaluations conducted in the course.

  8. Parallel implementation and evaluation of motion estimation system algorithms on a distributed memory multiprocessor using knowledge based mappings

    Science.gov (United States)

    Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.

    1989-01-01

    Several techniques to perform static and dynamic load balancing techniques for vision systems are presented. These techniques are novel in the sense that they capture the computational requirements of a task by examining the data when it is produced. Furthermore, they can be applied to many vision systems because many algorithms in different systems are either the same, or have similar computational characteristics. These techniques are evaluated by applying them on a parallel implementation of the algorithms in a motion estimation system on a hypercube multiprocessor system. The motion estimation system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from different time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters. It is shown that the performance gains when these data decomposition and load balancing techniques are used are significant and the overhead of using these techniques is minimal.

  9. Mathematical Programming Method Based on Chaos Anti-Control for the Solution of Forward Displacement of Parallel Robot Mechanisms

    Directory of Open Access Journals (Sweden)

    Youxin Luo

    2013-01-01

    Full Text Available The pose of the moving platform in parallel robots is possible thanks to the strong coupling, but it consequently is very difficult to obtain its forward displacement. Different methods establishing forward displacement can obtain different numbers of variables and different solving speeds with nonlinear equations. The nonlinear equations with nine variables for forward displacement in the general 6-6 type parallel mechanism were created using the rotation transformation matrix R, translation vector P and the constraint conditions of the rod length. Given the problems of there being only one solution and sometimes no convergence when solving nonlinear equations with the Newton method and the quasi-Newton method, the Euler equation for free rotation in a rigid body was applied to a chaotic system by using chaos anti-control and chaotic sequences were produced. Combining the characteristics of the chaotic sequence with the mathematical programming method, a new mathematical programming method was put forward, which was based on chaos anti-control with the aim of solving all real solutions of nonlinear equations for forward displacement in the general 6-6 type parallel mechanism. The numerical example shows that the new method has some positive characteristics such as that it runs in the initial value range, it has fast convergence, it can find all the possible real solutions that be found out and it proves the correctness and validity of this method when compared with other methods.

  10. Mathematical Programming Method Based on Chaos Anti-Control for the Solution of Forward Displacement of Parallel Robot Mechanisms

    Directory of Open Access Journals (Sweden)

    Youxin Luo

    2013-01-01

    Full Text Available The pose of the moving platform in parallel robots is possible thanks to the strong coupling, but it consequently is very difficult to obtain its forward displacement. Different methods establishing forward displacement can obtain different numbers of variables and different solving speeds with nonlinear equations. The nonlinear equations with nine variables for forward displacement in the general 6‐6 type parallel mechanism were created using the rotation transformation matrix R , translation vector P and the constraint conditions of the rod length. Given the problems of there being only one solution and sometimes no convergence when solving nonlinear equations with the Newton method and the quasi‐Newton method, the Euler equation for free rotation in a rigid body was applied to a chaotic system by using chaos anti‐control and chaotic sequences were produced. Combining the characteristics of the chaotic sequence with the mathematical programming method, a new mathematical programming method was put forward, which was based on chaos anti‐control with the aim of solving all real solutions of nonlinear equations for forward displacement in the general 6‐6 type parallel mechanism. The numerical example shows that the new method has some positive characteristics such as that it runs in the initial value range, it has fast convergence, it can find all the possible real solutions that be found out and it proves the correctness and validity of this method when compared with other methods.

  11. Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

    Science.gov (United States)

    Bellerby, Tim

    2015-04-01

    PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks number of processors

  12. Parallel scheduling algorithms

    Energy Technology Data Exchange (ETDEWEB)

    Dekel, E.; Sahni, S.

    1983-01-01

    Parallel algorithms are given for scheduling problems such as scheduling to minimize the number of tardy jobs, job sequencing with deadlines, scheduling to minimize earliness and tardiness penalties, channel assignment, and minimizing the mean finish time. The shared memory model of parallel computers is used to obtain fast algorithms. 26 references.

  13. Abstract Level Parallelization of Finite Difference Methods

    Directory of Open Access Journals (Sweden)

    Edwin Vollebregt

    1997-01-01

    Full Text Available A formalism is proposed for describing finite difference calculations in an abstract way. The formalism consists of index sets and stencils, for characterizing the structure of sets of data items and interactions between data items (“neighbouring relations”. The formalism provides a means for lifting programming to a more abstract level. This simplifies the tasks of performance analysis and verification of correctness, and opens the way for automaticcode generation. The notation is particularly useful in parallelization, for the systematic construction of parallel programs in a process/channel programming paradigm (e.g., message passing. This is important because message passing, unfortunately, still is the only approach that leads to acceptable performance for many more unstructured or irregular problems on parallel computers that have non-uniform memory access times. It will be shown that the use of index sets and stencils greatly simplifies the determination of which data must be exchanged between different computing processes.

  14. A parallel buffer tree

    DEFF Research Database (Denmark)

    Sitchinava, Nodar; Zeh, Norbert

    2012-01-01

    We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number...... of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism. The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of N insertions, deletions, membership queries, and range queries...

  15. Parallel biocomputing

    Directory of Open Access Journals (Sweden)

    Witte John S

    2011-03-01

    Full Text Available Abstract Background With the advent of high throughput genomics and high-resolution imaging techniques, there is a growing necessity in biology and medicine for parallel computing, and with the low cost of computing, it is now cost-effective for even small labs or individuals to build their own personal computation cluster. Methods Here we briefly describe how to use commodity hardware to build a low-cost, high-performance compute cluster, and provide an in-depth example and sample code for parallel execution of R jobs using MOSIX, a mature extension of the Linux kernel for parallel computing. A similar process can be used with other cluster platform software. Results As a statistical genetics example, we use our cluster to run a simulated eQTL experiment. Because eQTL is computationally intensive, and is conceptually easy to parallelize, like many statistics/genetics applications, parallel execution with MOSIX gives a linear speedup in analysis time with little additional effort. Conclusions We have used MOSIX to run a wide variety of software programs in parallel with good results. The limitations and benefits of using MOSIX are discussed and compared to other platforms.

  16. Reverse Programmed SONOS Memory Technique for 0.18μm Embedded Utilization

    Institute of Scientific and Technical Information of China (English)

    2007-01-01

    A 4 Mb embedded silicon-oxide-nitride-oxide-silicon (SONOS) memory was developed with a 0.18 μm CMOS logic compatible technology. A reverse programming array architecture was proposed to reduce the chip area, enhance the operating window, and increase the read speed. The charge distribution was analyzed to optimize the programming and erase conditions considering both the operating speed and the endurance performance. The final test chip has a good endurance of 105 cycles and a data retention time of at least 10 years.

  17. Strength analysis of parallel robot components in PLM Siemens NX 8.5 program

    Science.gov (United States)

    Ociepka, P.; Herbus, K.

    2015-11-01

    This article presents a series of numerical analyses in order to identify the states of stress in elements, which arise during the operation of the mechanism. The object of the research was parallel robot, which is the basis for the prototype of a driving simulator. To conduct the dynamic analysis was used the Motion Simulation module and the RecurDyn solver. In this module were created the joints which occur in the mechanism of a parallel robot. Next dynamic analyzes were performed to determine the maximal forces that will applied to the analyzed elements. It was also analyzed the platform motion during the simulation a collision of a car with a wall. In the next step, basing on the results obtained in the dynamic analysis, were performed the strength analyzes in the Advanced Simulation module. For calculation the NX Nastran solver was used.

  18. Structured Parallel Programming: patterns for efficient computation : Michael McCool, Arch D Robison, James Reinders, Morgan Kaufmann-Elsevier 2012

    OpenAIRE

    De Giusti, Armando Eduardo

    2015-01-01

    In this book the authors, who are parallel computing experts and industry insiders, describe how to design and implement maintainable and efficient parallel algorithms using a pattern-based approach. They present both theory and practice, and give some specific examples using multiple programming models.

  19. Programming Environment for a High-Performance Parallel Supercomputer with Intelligent Communication

    OpenAIRE

    A. Gunzinger; BÄumle, B.; Frey, M.; Klebl, M.; Kocheisen, M.; Kohler, P.; Morel, R.; Müller, U; Rosenthal, M

    1996-01-01

    At the Electronics Laboratory of the Swiss Federal Institute of Technology (ETH) in Zürich, the high-performance parallel supercomputer MUSIC (MUlti processor System with Intelligent Communication) has been developed. As applications like neural network simulation and molecular dynamics show, the Electronics Laboratory supercomputer is absolutely on par with those of conventional supercomputers, but electric power requirements are reduced by a factor of 1,000, weight is reduced by a factor of...

  20. Effects of a physical fitness program on memory and blood viscosity in sedentary elderly men

    Directory of Open Access Journals (Sweden)

    H.K. Antunes

    2015-01-01

    Full Text Available The aim of this study was to investigate the effects of a 6-month exercise program on cognitive function and blood viscosity in sedentary elderly men. Forty-six healthy inactive men, aged 60–75 years were randomly distributed into a control group (n=23 and an experimental group (n=23. Participants underwent blood analysis and physical and memory evaluation, before and after the 6-month program of physical exercise. The control group was instructed not to alter its everyday activities; the experimental group took part in the fitness program. The program was conducted using a cycle ergometer, 3 times per week on alternate days, with intensity and volume individualized at ventilatory threshold 1. Sessions were continuous and maximum duration was 60 min each. There was significant improvement in memory (21%; P<0.05, decreased blood viscosity (−19%; P<0.05, and higher aerobic capacity (48%; P<0.05 among participants in the experimental group compared with the control group. These data suggest that taking part in an aerobic physical fitness program at an intensity corresponding to ventilatory threshold-1 may be considered a nonmedication alternative to improve physical and cognitive function.

  1. A database for on-line event analysis on a distributed memory machine

    CERN Document Server

    Argante, E; Van der Stok, P D V; Willers, Ian Malcolm

    1995-01-01

    Parallel in-memory databases can enhance the structuring and parallelization of programs used in High Energy Physics (HEP). Efficient database access routines are used as communication primitives which hide the communication topology in contrast to the more explicit communications like PVM or MPI. A parallel in-memory database, called SPIDER, has been implemented on a 32 node Meiko CS-2 distributed memory machine. The spider primitives generate a lower overhead than the one generated by PVM or PMI. The event reconstruction program, CPREAD of the CPLEAR experiment, has been used as a test case. Performance measurerate generated by CPLEAR.

  2. Implicit Memory in Monkeys: Development of a Delay Eyeblink Conditioning System with Parallel Electromyographic and High-Speed Video Measurements.

    Directory of Open Access Journals (Sweden)

    Yasushi Kishimoto

    Full Text Available Delay eyeblink conditioning, a cerebellum-dependent learning paradigm, has been applied to various mammalian species but not yet to monkeys. We therefore developed an accurate measuring system that we believe is the first system suitable for delay eyeblink conditioning in a monkey species (Macaca mulatta. Monkey eyeblinking was simultaneously monitored by orbicularis oculi electromyographic (OO-EMG measurements and a high-speed camera-based tracking system built around a 1-kHz CMOS image sensor. A 1-kHz tone was the conditioned stimulus (CS, while an air puff (0.02 MPa was the unconditioned stimulus. EMG analysis showed that the monkeys exhibited a conditioned response (CR incidence of more than 60% of trials during the 5-day acquisition phase and an extinguished CR during the 2-day extinction phase. The camera system yielded similar results. Hence, we conclude that both methods are effective in evaluating monkey eyeblink conditioning. This system incorporating two different measuring principles enabled us to elucidate the relationship between the actual presence of eyelid closure and OO-EMG activity. An interesting finding permitted by the new system was that the monkeys frequently exhibited obvious CRs even when they produced visible facial signs of drowsiness or microsleep. Indeed, the probability of observing a CR in a given trial was not influenced by whether the monkeys closed their eyelids just before CS onset, suggesting that this memory could be expressed independently of wakefulness. This work presents a novel system for cognitive assessment in monkeys that will be useful for elucidating the neural mechanisms of implicit learning in nonhuman primates.

  3. Using a multi-port architecture of neural-net associative memory based on the equivalency paradigm for parallel cluster image analysis and self-learning

    Science.gov (United States)

    Krasilenko, Vladimir G.; Lazarev, Alexander A.; Grabovlyak, Sveta K.; Nikitovich, Diana V.

    2013-01-01

    We consider equivalency models, including matrix-matrix and matrix-tensor and with the dual adaptive-weighted correlation, multi-port neural-net auto-associative and hetero-associative memory (MP NN AAM and HAP), which are equivalency paradigm and the theoretical basis of our work. We make a brief overview of the possible implementations of the MP NN AAM and of their architectures proposed and investigated earlier by us. The main base unit of such architectures is a matrix-matrix or matrix-tensor equivalentor. We show that the MP NN AAM based on the equivalency paradigm and optoelectronic architectures with space-time integration and parallel-serial 2D images processing have advantages such as increased memory capacity (more than ten times of the number of neurons!), high performance in different modes (1010 - 1012 connections per second!) And the ability to process, store and associatively recognize highly correlated images. Next, we show that with minor modifications, such MP NN AAM can be successfully used for highperformance parallel clustering processing of images. We show simulation results of using these modifications for clustering and learning models and algorithms for cluster analysis of specific images and divide them into categories of the array. Show example of a cluster division of 32 images (40x32 pixels) letters and graphics for 12 clusters with simultaneous formation of the output-weighted space allocated images for each cluster. We discuss algorithms for learning and self-learning in such structures and their comparative evaluations based on Mathcad simulations are made. It is shown that, unlike the traditional Kohonen self-organizing maps, time of learning in the proposed structures of multi-port neuronet classifier/clusterizer (MP NN C) on the basis of equivalency paradigm, due to their multi-port, decreases by orders and can be, in some cases, just a few epochs. Estimates show that in the test clustering of 32 1280- element images into 12

  4. Improving memory in Parkinson's disease: a healthy brain ageing cognitive training program.

    Science.gov (United States)

    Naismith, Sharon L; Mowszowski, Loren; Diamond, Keri; Lewis, Simon J G

    2013-07-01

    This study aimed to evaluate the efficacy of a multifactorial 'healthy brain ageing cognitive training program' for Parkinson's disease. Using a single-blinded waitlist control design, 50 participants with Parkinson's disease were recruited from the Brain & Mind Research Institute, Sydney, Australia. The intervention encompassed both psychoeducation and cognitive training; each component lasted 1-hour. The 2-hour sessions were delivered in a group format, twice-weekly over a 7-week period. Multifactorial psychoeducation was delivered by a range of health professionals. In addition to delivering cognitive strategies, it targeted depression, anxiety, sleep, vascular risk factors, diet, and exercise. Cognitive training was computer-based and was conducted by clinical neuropsychologists. The primary outcome was memory. Secondary outcomes included other aspects of cognition and knowledge pertaining to the psychoeducation material. Results demonstrated that cognitive training was associated with significant improvements in learning and memory corresponding to medium to large effect sizes. Treatment was also associated with medium effect size improvements in knowledge. Although the study was limited by the lack of randomized allocation to treatment and control groups, these findings suggest that a healthy brain ageing cognitive training program may be a viable tool to improve memory and/or slow cognitive decline in people with Parkinson's disease. It also appeared successful for increasing awareness of adaptive and/or compensatory cognitive strategies, as well as modifiable risk factors to optimize brain functioning.

  5. Fish oil supplementation and physical exercise program: distinct effects on different memory tasks.

    Science.gov (United States)

    Rachetti, A L F; Arida, R M; Patti, C L; Zanin, K A; Fernades-Santos, L; Frussa-Filho, R; Gomes da Silva, S; Scorza, F A; Cysneiros, R M

    2013-01-15

    Both fish oil supplementation and physical exercise are able to induce benefits to mental health by providing an improvement in cognitive performance and enhancing neuroplasticity and protection against neurological lesions. The aim of the present study was to investigate the cognitive effects in rats of the: (1) a diary and prolonged fish oil supplementation (85 mg/kg/day) initiated from prenatal period to the midlife (300 day/old); (2) moderate physical exercise in treadmill initiated from adolescent period to midlife and (3) association of fish oil supplementation and moderate physical exercise protocol during the same period. Animals were submitted to the habituation in the open-field, object recognition and to the plus-maze discriminative avoidance tasks. Our results demonstrated that a diary and prolonged fish oil supplementation can facilitate the persistence of the long-term habituation and recognition memories without, however, affecting the discriminative avoidance memory. Conversely, although the program of physical exercise exerted no effects on habituation or objects recognition, it was able to potentiate the persistence of the discriminative avoidance memory. Such promnestic effects (induced by both fish oil supplementation and physical exercise) were not accompanied by alterations in emotionality or locomotor activity. Our findings suggest that fish oil supplementation, initiated from prenatal period to midlife, and physical exercise program applied throughout the life induced distinctly a better cognitive performance.

  6. A model of long-term memory storage in the cerebellar cortex: a possible role for plasticity at parallel fiber synapses onto stellate/basket interneurons.

    Science.gov (United States)

    Kenyon, G T

    1997-12-09

    By evoking changes in climbing fiber activity, movement errors are thought to modify synapses from parallel fibers onto Purkinje cells (pf*Pkj) so as to improve subsequent motor performance. Theoretical arguments suggest there is an intrinsic tradeoff, however, between motor adaptation and long-term storage. Assuming a baseline rate of motor errors is always present, then repeated performance of any learned movement will generate a series of climbing fiber-mediated corrections. By reshuffling the synaptic weights responsible for any given movement, such corrections will degrade the memories for other learned movements stored in overlapping sets of synapses. The present paper shows that long-term storage can be accomplished by a second site of plasticity at synapses from parallel fibers onto stellate/basket interneurons (pf*St/Bk). Plasticity at pf*St/Bk synapses can be insulated from ongoing fluctuations in climbing fiber activity by assuming that changes in pf*St/Bk synapses occur only after changes in pf*Pkj synapses have built up to a threshold level. Although climbing fiber-dependent plasticity at pf*Pkj synapses allows for the exploration of novel motor strategies in response to changing environmental conditions, plasticity at pf*St/Bk synapses transfers successful strategies to stable long-term storage. To quantify this hypothesis, both sites of plasticity are incorporated into a dynamical model of the cerebellar cortex and its interactions with the inferior olive. When used to simulate idealized motor conditioning trials, the model predicts that plasticity develops first at pf*Pkj synapses, but with additional training is transferred to pf*St/Bk synapses for long-term storage.

  7. Primitive parallel operations for computational linear algebra

    Energy Technology Data Exchange (ETDEWEB)

    Panetta, J.

    1985-01-01

    This work is a small step in the direction of code portability over parallel and vector machines. The proposal consists of a style of programming and a set of parallel operators built over abstract data types. Objects and operators are directed to the Computational Linear Algebra area, although the principles of the proposal can be applied to any other area. A subset of the operators was implemented on a 64-processor, distributed memory MIMD machine, and the results are that computationally intensive operators achieve asymptotically optimal speed-ups, but data movement operators are inefficient, some even intrinsically sequential.

  8. A program for undergraduate research into the mechanisms of sensory coding and memory decay

    Energy Technology Data Exchange (ETDEWEB)

    Calin-Jageman, R J

    2010-09-28

    This is the final technical report for this DOE project, entitltled "A program for undergraduate research into the mechanisms of sensory coding and memory decay". The report summarizes progress on the three research aims: 1) to identify phyisological and genetic correlates of long-term habituation, 2) to understand mechanisms of olfactory coding, and 3) to foster a world-class undergraduate neuroscience program. Progress on the first aim has enabled comparison of learning-regulated transcripts across closely related learning paradigms and species, and results suggest that only a small core of transcripts serve truly general roles in long-term memory. Progress on the second aim has enabled testing of several mutant phenotypes for olfactory behaviors, and results show that responses are not fully consistent with the combinitoral coding hypothesis. Finally, 14 undergraduate students participated in this research, the neuroscience program attracted extramural funding, and we completed a successful summer program to enhance transitions for community-college students into 4-year colleges to persue STEM fields.

  9. 76 FR 66309 - Pilot Program for Parallel Review of Medical Products; Correction

    Science.gov (United States)

    2011-10-26

    ... Federal Register of October 11, 2011 (76 FR 62808). The document announced a pilot program for sponsors of...-796-6579. SUPPLEMENTARY INFORMATION: In FR Doc. 2011-25907, appearing on page 62808 in the Federal... HUMAN SERVICES Centers for Medicare and Medicaid Services Food and Drug Administration Pilot Program...

  10. Asynchronous Adaptive Optimisation for Generic Data-Parallel Array Programming and Beyond

    NARCIS (Netherlands)

    Grelck, C.

    2011-01-01

    We present the concept of an adaptive compiler optimisation framework for the functional array programming language SaC, Single Assignment C. SaC advocates shape- and rank-generic programming with multidimensional arrays. A sophisticated, highly optimising compiler technology nonetheless achieves co

  11. Asynchronous Adaptive Optimisation for Generic Data-Parallel Array Programming and Beyond

    NARCIS (Netherlands)

    Grelck, C.

    2011-01-01

    We present the concept of an adaptive compiler optimisation framework for the functional array programming language SaC, Single Assignment C. SaC advocates shape- and rank-generic programming with multidimensional arrays. A sophisticated, highly optimising compiler technology nonetheless achieves co

  12. Data Handover: Reconciling Message Passing and Shared Memory

    OpenAIRE

    Gustedt, Jens

    2004-01-01

    Data Handover (DHO) is a programming paradigm and interface that aims to handle data between parallel or distributed processes that mixes aspects of message passing and shared memory. It is designed to overcome the potential problems in terms of efficiency of both: (1) memory blowup and forced copies for message passing and (2) data consistency and latency problems for shared memory. Our approach attempts to be simple and easy to understand. It content...

  13. MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

    Energy Technology Data Exchange (ETDEWEB)

    Howison, Mark; Bethel, E. Wes; Childs, Hank

    2010-03-20

    This work studies the performance and scalability characteristics of"hybrid'" parallel programming and execution as applied to raycasting volume rendering -- a staple visualization algorithm -- on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today and 128-core chips coming soon, we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.

  14. Hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

    Science.gov (United States)

    Howison, M.; Bethel, E. W.; Childs, H.

    2011-10-01

    This work studies the performance and scalability characteristics of "hybrid" parallel programming and execution as applied to raycasting volume rendering - a staple visualization algorithm - on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today, as well as processors capable of running hundreds of concurrent threads (GPUs), we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.

  15. A Study on the Effect of Communication Performance on Message-Passing Parallel Programs: Methodology and Case Studies

    Science.gov (United States)

    Sarukkai, Sekhar R.; Yan, Jerry; Woodrow, Thomas (Technical Monitor)

    1994-01-01

    From a source-program perspective, the performance achieved on distributed/parallel systems is governed by the underlying message-passing library overhead and the network capabilities of the architecture. Studying the impact of changes in these features on the source-program. can have a significant influence in the development of next-generation system designs. In this paper we introduce a simple and robust tool that can be used for this purpose. This tool is based on event-driven simulation of programs that generates a new set of trace events - that preserves causality and partial order - corresponding to the expected execution of the program in the simulated environment. Trace events can be visualized and source-level profile information can be used to pin-point locations of program which are most significantly affected with changing system parameters in the simulated environment. We present a number of examples from the NAS benchmark suite, executed on the Intel Paragon and iPSC/860 that are used to identify and expose performance bottlenecks with varying system parameters. Specific aspects of the system that significantly effect these benchmarks are presented and discussed,

  16. A Study on the Effect of Communication Performance on Message-Passing Parallel Programs: Methodology and Case Studies

    Science.gov (United States)

    Sarukkai, Sekhar R.; Yan, Jerry; Woodrow, Thomas (Technical Monitor)

    1994-01-01

    From a source-program perspective, the performance achieved on distributed/parallel systems is governed by the underlying message-passing library overhead and the network capabilities of the architecture. Studying the impact of changes in these features on the source-program. can have a significant influence in the development of next-generation system designs. In this paper we introduce a simple and robust tool that can be used for this purpose. This tool is based on event-driven simulation of programs that generates a new set of trace events - that preserves causality and partial order - corresponding to the expected execution of the program in the simulated environment. Trace events can be visualized and source-level profile information can be used to pin-point locations of program which are most significantly affected with changing system parameters in the simulated environment. We present a number of examples from the NAS benchmark suite, executed on the Intel Paragon and iPSC/860 that are used to identify and expose performance bottlenecks with varying system parameters. Specific aspects of the system that significantly effect these benchmarks are presented and discussed,

  17. Parallelization and automatic data distribution for nuclear reactor simulations

    Energy Technology Data Exchange (ETDEWEB)

    Liebrock, L.M. [Liebrock-Hicks Research, Calumet, MI (United States)

    1997-07-01

    Detailed attempts at realistic nuclear reactor simulations currently take many times real time to execute on high performance workstations. Even the fastest sequential machine can not run these simulations fast enough to ensure that the best corrective measure is used during a nuclear accident to prevent a minor malfunction from becoming a major catastrophe. Since sequential computers have nearly reached the speed of light barrier, these simulations will have to be run in parallel to make significant improvements in speed. In physical reactor plants, parallelism abounds. Fluids flow, controls change, and reactions occur in parallel with only adjacent components directly affecting each other. These do not occur in the sequentialized manner, with global instantaneous effects, that is often used in simulators. Development of parallel algorithms that more closely approximate the real-world operation of a reactor may, in addition to speeding up the simulations, actually improve the accuracy and reliability of the predictions generated. Three types of parallel architecture (shared memory machines, distributed memory multicomputers, and distributed networks) are briefly reviewed as targets for parallelization of nuclear reactor simulation. Various parallelization models (loop-based model, shared memory model, functional model, data parallel model, and a combined functional and data parallel model) are discussed along with their advantages and disadvantages for nuclear reactor simulation. A variety of tools are introduced for each of the models. Emphasis is placed on the data parallel model as the primary focus for two-phase flow simulation. Tools to support data parallel programming for multiple component applications and special parallelization considerations are also discussed.

  18. Neurite, a finite difference large scale parallel program for the simulation of electrical signal propagation in neurites under mechanical loading.

    Directory of Open Access Journals (Sweden)

    Julián A García-Grajales

    Full Text Available With the growing body of research on traumatic brain injury and spinal cord injury, computational neuroscience has recently focused its modeling efforts on neuronal functional deficits following mechanical loading. However, in most of these efforts, cell damage is generally only characterized by purely mechanistic criteria, functions of quantities such as stress, strain or their corresponding rates. The modeling of functional deficits in neurites as a consequence of macroscopic mechanical insults has been rarely explored. In particular, a quantitative mechanically based model of electrophysiological impairment in neuronal cells, Neurite, has only very recently been proposed. In this paper, we present the implementation details of this model: a finite difference parallel program for simulating electrical signal propagation along neurites under mechanical loading. Following the application of a macroscopic strain at a given strain rate produced by a mechanical insult, Neurite is able to simulate the resulting neuronal electrical signal propagation, and thus the corresponding functional deficits. The simulation of the coupled mechanical and electrophysiological behaviors requires computational expensive calculations that increase in complexity as the network of the simulated cells grows. The solvers implemented in Neurite--explicit and implicit--were therefore parallelized using graphics processing units in order to reduce the burden of the simulation costs of large scale scenarios. Cable Theory and Hodgkin-Huxley models were implemented to account for the electrophysiological passive and active regions of a neurite, respectively, whereas a coupled mechanical model accounting for the neurite mechanical behavior within its surrounding medium was adopted as a link between electrophysiology and mechanics. This paper provides the details of the parallel implementation of Neurite, along with three different application examples: a long myelinated axon

  19. Neurite, a finite difference large scale parallel program for the simulation of electrical signal propagation in neurites under mechanical loading.

    Science.gov (United States)

    García-Grajales, Julián A; Rucabado, Gabriel; García-Dopico, Antonio; Peña, José-María; Jérusalem, Antoine

    2015-01-01

    With the growing body of research on traumatic brain injury and spinal cord injury, computational neuroscience has recently focused its modeling efforts on neuronal functional deficits following mechanical loading. However, in most of these efforts, cell damage is generally only characterized by purely mechanistic criteria, functions of quantities such as stress, strain or their corresponding rates. The modeling of functional deficits in neurites as a consequence of macroscopic mechanical insults has been rarely explored. In particular, a quantitative mechanically based model of electrophysiological impairment in neuronal cells, Neurite, has only very recently been proposed. In this paper, we present the implementation details of this model: a finite difference parallel program for simulating electrical signal propagation along neurites under mechanical loading. Following the application of a macroscopic strain at a given strain rate produced by a mechanical insult, Neurite is able to simulate the resulting neuronal electrical signal propagation, and thus the corresponding functional deficits. The simulation of the coupled mechanical and electrophysiological behaviors requires computational expensive calculations that increase in complexity as the network of the simulated cells grows. The solvers implemented in Neurite--explicit and implicit--were therefore parallelized using graphics processing units in order to reduce the burden of the simulation costs of large scale scenarios. Cable Theory and Hodgkin-Huxley models were implemented to account for the electrophysiological passive and active regions of a neurite, respectively, whereas a coupled mechanical model accounting for the neurite mechanical behavior within its surrounding medium was adopted as a link between electrophysiology and mechanics. This paper provides the details of the parallel implementation of Neurite, along with three different application examples: a long myelinated axon, a segmented

  20. Programming margin enlargement by material engineering for multilevel storage in phase-change memory

    Science.gov (United States)

    Yin, You; Noguchi, Tomoyuki; Ohno, Hiroki; Hosaka, Sumio

    2009-09-01

    In this work, we investigate the effect of the material engineering on programming margin in the double-layered phase-change memory, which is the most important parameter for the stability of multilevel storage. Compared with the TiN/SbTeN cell, the TiSiN/GeSbTe double-layered cell exhibits the resistance ratio of the highest to lowest resistance levels up to two to three orders of magnitude, indicating much larger programming margin and thus higher stability and/or more available levels. Our calculation results show that the resistivities of the top heating layer and the phase-change layer have a significant effect on the programming margin.

  1. Study on Patterns for Parallel Programming Based on CMP System%基于CMP系统的并行编程模式研究

    Institute of Scientific and Technical Information of China (English)

    胥秀峰; 鲍广宇; 黄海燕; 吴亚宁

    2014-01-01

    Studying patterns for parallel programming based on CMP system is designed to build and develop the whole way for parallel program on CMP system. Firstly,briefly introduce the multi-core parallel computing,then put forward a conceptual model of patterns for parallel programming based on CMP system by summarizing the problem of parallel computing,and the patterns contains four elements of parallel architecture,parallel algorithm design model,development environment,parallel program implementation model. Then,describe the main connotation of the concepts. Finally illustrate the parallel programming patterns by a example,initially verifying the reasonable-ness of the patterns.%研究基于CMP( Chip Multiple Processors,片上多处理器)系统的并行编程模式旨在建立开发CMP系统上并行程序的整套方法。首先简要介绍了多核并行计算,然后通过对CMP系统上并行计算问题的综合归纳,提出了基于CMP系统的并行编程模式的概念模型,这个概念模型包含并行体系结构、并行算法设计模型、开发环境、并行程序实现模型四个核心要素;其次,对各并行编程模式各要素及其子概念的内涵进行了阐释;最后以实例对并行编程模式进行说明,初步验证了这套编程模式的合理性。

  2. Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications

    Science.gov (United States)

    Biswas, Rupak; Das, Sajal K.; Harvey, Daniel; Oliker, Leonid

    1999-01-01

    The ability to dynamically adapt an unstructured -rid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from the view point of portability on various multiprocessor platforms We address this problem by developing PLUM, tin automatic anti architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on, an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.

  3. Parallel programming of gradient-based iterative image reconstruction schemes for optical tomography.

    Science.gov (United States)

    Hielscher, Andreas H; Bartel, Sebastian

    2004-02-01

    Optical tomography (OT) is a fast developing novel imaging modality that uses near-infrared (NIR) light to obtain cross-sectional views of optical properties inside the human body. A major challenge remains the time-consuming, computational-intensive image reconstruction problem that converts NIR transmission measurements into cross-sectional images. To increase the speed of iterative image reconstruction schemes that are commonly applied for OT, we have developed and implemented several parallel algorithms on a cluster of workstations. Static process distribution as well as dynamic load balancing schemes suitable for heterogeneous clusters and varying machine performances are introduced and tested. The resulting algorithms are shown to accelerate the reconstruction process to various degrees, substantially reducing the computation times for clinically relevant problems.

  4. Memory T and memory B cells share a transcriptional program of self-renewal with long-term hematopoietic stem cells

    OpenAIRE

    Luckey, Chance John; Bhattacharya, Deepta; Goldrath, Ananda W; Weissman, Irving L; Benoist, Christophe; Mathis, Diane

    2006-01-01

    The only cells of the hematopoietic system that undergo self-renewal for the lifetime of the organism are long-term hematopoietic stem cells and memory T and B cells. To determine whether there is a shared transcriptional program among these self-renewing populations, we first compared the gene-expression profiles of naïve, effector and memory CD8+ T cells with those of long-term hematopoietic stem cells, short-term hematopoietic stem cells, and lineage-committed progenitors. Transcripts augm...

  5. Parallel image computation in clusters with task-distributor.

    Science.gov (United States)

    Baun, Christian

    2016-01-01

    Distributed systems, especially clusters, can be used to execute ray tracing tasks in parallel for speeding up the image computation. Because ray tracing is a computational expensive and memory consuming task, ray tracing can also be used to benchmark clusters. This paper introduces task-distributor, a free software solution for the parallel execution of ray tracing tasks in distributed systems. The ray tracing solution used for this work is the Persistence Of Vision Raytracer (POV-Ray). Task-distributor does not require any modification of the POV-Ray source code or the installation of an additional message passing library like the Message Passing Interface or Parallel Virtual Machine to allow parallel image computation, in contrast to various other projects. By analyzing the runtime of the sequential and parallel program parts of task-distributor, it becomes clear how the problem size and available hardware resources influence the scaling of the parallel application.

  6. Two applications of parallel processing in power system computation

    Energy Technology Data Exchange (ETDEWEB)

    Lemaitre, C.; Thomas, B. [Electricite de France, 92 - Clamart (France). Research and Development Div.

    1996-12-31

    Performance improvements are discussed achieved in two power system software modules through the use of parallel processing techniques. The first software module, EVARISTE, outputs a voltage stability indicator for various power system situations. The second module, MEXICO, assesses power system reliability and operating costs by simulating a large number of contingencies for generation and transmission equipment. Both software modules are well-suited to coarse-grain parallel processing. The first module was parallelized on a distributed-memory machine and the second on a shared-memory machine. A description of the parallelization process used in these two cases is presented, and details on the performance levels achieved are discussed, including aspects of programming, parameter selection, and machine characteristics. (author) 7 refs.

  7. 多核处理器并行程序的确定性重放研究∗%Deterministic Replay for Parallel Programs in Multi-Core Processors

    Institute of Scientific and Technical Information of China (English)

    高岚; 王锐; 钱德沛

    2013-01-01

      多核处理器并行程序的确定性重放是实现并行程序调试的有效手段,对并行编程有重要意义。但由于多核架构下存在共享访存不同步问题,并行程序确定性重放的研究依然面临多方面的挑战,给并行程序的调试带来很大困难,严重影响了多核架构下并行程序的普及和发展。分析了多核处理器造成并行程序确定性重放难以实现的关键因素,总结了确定性重放的评价指标,综述了近年来学术界对并行程序确定性重放的研究。根据总结的评价指标,从纯软件方式和硬件支持方式对目前的确定性重放方法进行了分析与对比,并在此基础上对多核架构下并行程序的确定性重放未来的研究趋势和应用前景进行了展望。%The deterministic replay for parallel programs in multi-core processor systems is important for the debugging and dissemination of parallel programs, however, due to the difficulty in tackling unsynchronized accessing of shared memory in multiprocessors, industrial-level deterministic replay for parallel programs have not emerged yet. This paper analyzes non-deterministic events in multi-core processor systems and summarizes metrics of deterministic replay schemes. After studying the research for deterministic multi-core processor replay in recent years, this paper introduces the proposed deterministic replay schemes for parallel programs in multi-core processor systems, investigates characteristics of software-pure and hardware-assisted deterministic replay schemes, analyzes current researches and gives the prospects of deterministic replay for parallel programs in multi-core processor systems.

  8. Feasibility study of current pulse induced 2-bit/4-state multilevel programming in phase-change memory

    Science.gov (United States)

    Liu, Yan; Fan, Xi; Chen, Houpeng; Wang, Yueqing; Liu, Bo; Song, Zhitang; Feng, Songlin

    2017-08-01

    In this brief, multilevel data storage for phase-change memory (PCM) has attracted more attention in the memory market to implement high capacity memory system and reduce cost-per-bit. In this work, we present a universal programing method of SET stair-case current pulse in PCM cells, which can exploit the optimum programing scheme to achieve 2-bit/ 4state resistance-level with equal logarithm interval. SET stair-case waveform can be optimized by TCAD real time simulation to realize multilevel data storage efficiently in an arbitrary phase change material. Experimental results from 1 k-bit PCM test-chip have validated the proposed multilevel programing scheme. This multilevel programming scheme has improved the information storage density, robustness of resistance-level, energy efficient and avoiding process complexity.

  9. Data-parallel DNS of turbulent flow

    NARCIS (Netherlands)

    Verstappen, R.W.C.P.; Veldman, A.E.P.; Emerson, DR; Ecer, A; Periaux, J; Satofuka, N

    1998-01-01

    This contribution deals with direct numerical simulation (DNS) of incompressible turbulent flows on parallel computers. We make use of the data-parallel model on shared memory systems as well as on a distributed memory machine. The combination of fast parallel computers and efficient numerical algor

  10. Memory-based frame synchronizer. [for digital communication systems

    Science.gov (United States)

    Stattel, R. J.; Niswander, J. K. (Inventor)

    1981-01-01

    A frame synchronizer for use in digital communications systems wherein data formats can be easily and dynamically changed is described. The use of memory array elements provide increased flexibility in format selection and sync word selection in addition to real time reconfiguration ability. The frame synchronizer comprises a serial-to-parallel converter which converts a serial input data stream to a constantly changing parallel data output. This parallel data output is supplied to programmable sync word recognizers each consisting of a multiplexer and a random access memory (RAM). The multiplexer is connected to both the parallel data output and an address bus which may be connected to a microprocessor or computer for purposes of programming the sync word recognizer. The RAM is used as an associative memory or decorder and is programmed to identify a specific sync word. Additional programmable RAMs are used as counter decoders to define word bit length, frame word length, and paragraph frame length.

  11. The Establishment of Parallel Systems on Microcomputer and the Development of Parallel Programs%微机环境下并行系统的建立与并行程序的开发

    Institute of Scientific and Technical Information of China (English)

    王顺绪; 李志英

    2001-01-01

    The significance of establishing parallel systems on microcomputer and making parallel simulation is illustrated with methods given to install PVM on microcomputer. Besides, the file (.cshrc) which makes PVM run correctly and the examples of program application in the model of master/slave are also presented.%阐述了在微机环境下建立并行环境,进行并行模拟的意义,给出了微机上PVM的安装方法和使PVM正确运行的.cshrc文件,以及master/slave编程模式的PVM应用程序示例.

  12. Design and Implementation of FEM Parallel Program%有限元并行程序设计与实现

    Institute of Scientific and Technical Information of China (English)

    余天堂; 姜弘道

    2000-01-01

    Parallel computation of FEM under systolic distributednetwork is a important direction of FEM parallel computation. A programdesigning method and its implementation for FEM parallel analysis undernetwork based on PVM is presented. Substructure parallel analysismethod with multi-front parallel processing is adopted in FEM parallelcomputation, the interface equations are solved with PreconditionedConjugate Gradient (PCG) method. The implementation of this designingmethod is easy. Example shows the designing method can obtain higherspeedup ratio.

  13. PARALLEL LDA ALGORITHM BASED ON SHARED MEMORY%基于共享内存的并行 LDA 算法

    Institute of Scientific and Technical Information of China (English)

    杨希; 刘晓升; 杨璐; 严建峰

    2016-01-01

    现有的共享内存的并行潜在狄利克雷分配(LDA)主题模型,通常由于数据分布的原因,线程之间一般存在等待导致效率低下。针对线程等待问题进行研究,提出一种基于动态的线程调度方案。该方案能够根据线程的数量进行分块,在此基础上及时为空闲的线程动态地分配任务,从而减少线程间等待时间。实验表明,这种新的调度方案能够有效地解决线程等待问题。该方案不仅在保证收敛精度的同时能够获得加速比25%的提升,还能显著提高向上扩展比。对于大规模分布式集群上单个节点的并行 LDA算法来说,这种调度可以更有效地利用计算资源。%Existing topical model of parallel latent Dirichlet allocation (LDA)with shared memory needs to wait between threads as a rule usually due to the unbalance distribution of data,which leads to low efficiency.In this paper we study the problem of thread waiting and pro-pose a dynamic-based threads scheduling scheme.The scheme can partition the data to blocks according to the number of threads and dynami-cally allocates tasks to idle threads timely on this basis,so as to reduce the waiting time between threads.Experiment shows that such new scheduling scheme can effectively solve threads waiting problem.It can achieve a 25% raise in speedups while ensuring convergence accura-cy,besides,it can also remarkably improve the upward expansion ratio.For parallel LDA algorithm of a single node in large-scale distributed cluster,the scheduling can more effectively utilise the computing resource.

  14. Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

    Science.gov (United States)

    2013-01-02

    ECL Embedded Common Lisp EM Expectation-Maximization FFI Foreign Function Interface FFT Fast Fourier Transform FFTW Fastest Fourier Transform in the...results to the user. 3.3 Domain-Specific Embedded Languages Domain-specific embedded languages are used in many programming languages such as Lisp ...designed to be extensible using metaprogramming, including Haskell and variants of Lisp . These DSELs generally transformed host language code into

  15. Petri Nets Based Modelling of Control Flow for Memory-Aid Interactive Programs in Telemedicine

    CERN Document Server

    Khoromskaia, V K

    2004-01-01

    Petri Nets (PN) based modelling of the control flow for the interactive memory assistance programs designed for personal pocket computers and having special requirements for robustness is considered. The proposed concept allows one to elaborate the programs which can give users a variety of possibilities for a day-time planning in the presence of environmental and time restrictions. First, a PN model for a known simple algorithm is constructed and analyzed using the corresponding state equations and incidence matrix. Then a PN graph for a complicated algorithm with overlapping actions and choice possibilities is designed, supplemented by an example of its analysis. Dynamic behaviour of this graph is tested by tracing of all possible paths of the flow of control using the PN simulator. It is shown that PN based modelling provides reliably predictable performance of interactive algorithms with branched structures and concurrency requirements.

  16. Practical parallel computing

    CERN Document Server

    Morse, H Stephen

    1994-01-01

    Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi

  17. A distributed computing approach to improve the performance of the Parallel Ocean Program (v2.1

    Directory of Open Access Journals (Sweden)

    B. van Werkhoven

    2013-09-01

    Full Text Available The Parallel Ocean Program (POP is used in many strongly eddying ocean circulation simulations. Ideally one would like to do thousand-year long simulations, but the current performance of POP prohibits this type of simulations. In this work, using a new distributed computing approach, two innovations to improve the performance of POP are presented. The first is a new block partitioning scheme for the optimization of the load balancing of POP such that it can be run efficiently in a multi-platform setting. The second is an implementation of part of the POP model code on Graphics Processing Units. We show that the combination of both innovations leads to a substantial performance increase also when running POP simultaneously over multiple computational platforms.

  18. Comparative Study of Dynamic Programming and Pontryagin’s Minimum Principle on Energy Management for a Parallel Hybrid Electric Vehicle

    Directory of Open Access Journals (Sweden)

    Huei Peng

    2013-04-01

    Full Text Available This paper compares two optimal energy management methods for parallel hybrid electric vehicles using an Automatic Manual Transmission (AMT. A control-oriented model of the powertrain and vehicle dynamics is built first. The energy management is formulated as a typical optimal control problem to trade off the fuel consumption and gear shifting frequency under admissible constraints. The Dynamic Programming (DP and Pontryagin’s Minimum Principle (PMP are applied to obtain the optimal solutions. Tuning with the appropriate co-states, the PMP solution is found to be very close to that from DP. The solution for the gear shifting in PMP has an algebraic expression associated with the vehicular velocity and can be implemented more efficiently in the control algorithm. The computation time of PMP is significantly less than DP.

  19. A parallel PCG solver for MODFLOW.

    Science.gov (United States)

    Dong, Yanhui; Li, Guomin

    2009-01-01

    In order to simulate large-scale ground water flow problems more efficiently with MODFLOW, the OpenMP programming paradigm was used to parallelize the preconditioned conjugate-gradient (PCG) solver with in this study. Incremental parallelization, the significant advantage supported by OpenMP on a shared-memory computer, made the solver transit to a parallel program smoothly one block of code at a time. The parallel PCG solver, suitable for both MODFLOW-2000 and MODFLOW-2005, is verified using an 8-processor computer. Both the impact of compilers and different model domain sizes were considered in the numerical experiments. Based on the timing results, execution times using the parallel PCG solver are typically about 1.40 to 5.31 times faster than those using the serial one. In addition, the simulation results are the exact same as the original PCG solver, because the majority of serial codes were not changed. It is worth noting that this parallelizing approach reduces cost in terms of software maintenance because only a single source PCG solver code needs to be maintained in the MODFLOW source tree.

  20. Research on Parallel Program Design Method Based on Distributing Object%基于分布对象的并行程序设计方法研究

    Institute of Scientific and Technical Information of China (English)

    龚向坚; 邹腊梅; 马淑萍

    2011-01-01

    Studies the parallel implementation of distributed object and optimization, proposes a distributed object-based parallel programming method and puts forward a distributed object-based parallel programming model. In this way it completes the design and implementation of a virtual computer network experimental systems, the experimental result shows that the virtual computer network experimental system has better parallelism and moderate responding rate. It proves that parallel programming method based on distributed object is effective in improving the computer system parallelism.%研究分布式对象的并行实现及优化.提出一种基于分布式对象的并行程序设计方法,构建一个基于分布式对象的并行程序设计模型.并以此方法完成虚拟计算机网络实验系统的设计和实现。实验结果表明,该虚拟计算机网络实验系统并行性较好、响应速度适中,证明基于分布式对象的并行程序设计方法在改善微机系统并行性上具有一定的作用。

  1. Parallel models of associative memory

    CERN Document Server

    Hinton, Geoffrey E

    2014-01-01

    This update of the 1981 classic on neural networks includes new commentaries by the authors that show how the original ideas are related to subsequent developments. As researchers continue to uncover ways of applying the complex information processing abilities of neural networks, they give these models an exciting future which may well involve revolutionary developments in understanding the brain and the mind -- developments that may allow researchers to build adaptive intelligent machines. The original chapters show where the ideas came from and the new commentaries show where they are going

  2. Micro-mechanical Simulations of Soils using Massively Parallel Supercomputers

    Directory of Open Access Journals (Sweden)

    David W. Washington

    2004-06-01

    Full Text Available In this research a computer program, Trubal version 1.51, based on the Discrete Element Method was converted to run on a Connection Machine (CM-5,a massively parallel supercomputer with 512 nodes, to expedite the computational times of simulating Geotechnical boundary value problems. The dynamic memory algorithm in Trubal program did not perform efficiently in CM-2 machine with the Single Instruction Multiple Data (SIMD architecture. This was due to the communication overhead involving global array reductions, global array broadcast and random data movement. Therefore, a dynamic memory algorithm in Trubal program was converted to a static memory arrangement and Trubal program was successfully converted to run on CM-5 machines. The converted program was called "TRUBAL for Parallel Machines (TPM." Simulating two physical triaxial experiments and comparing simulation results with Trubal simulations validated the TPM program. With a 512 nodes CM-5 machine TPM produced a nine-fold speedup demonstrating the inherent parallelism within algorithms based on the Discrete Element Method.

  3. PKA Increases in the Olfactory Bulb Act as Unconditioned Stimuli and Provide Evidence for Parallel Memory Systems: Pairing Odor with Increased PKA Creates Intermediate- and Long-Term, but Not Short-Term, Memories

    Science.gov (United States)

    Grimes, Matthew T.; Harley, Carolyn W.; Darby-King, Andrea; McLean, John H.

    2012-01-01

    Neonatal odor-preference memory in rat pups is a well-defined associative mammalian memory model dependent on cAMP. Previous work from this laboratory demonstrates three phases of neonatal odor-preference memory: short-term (translation-independent), intermediate-term (translation-dependent), and long-term (transcription- and…

  4. PKA Increases in the Olfactory Bulb Act as Unconditioned Stimuli and Provide Evidence for Parallel Memory Systems: Pairing Odor with Increased PKA Creates Intermediate- and Long-Term, but Not Short-Term, Memories

    Science.gov (United States)

    Grimes, Matthew T.; Harley, Carolyn W.; Darby-King, Andrea; McLean, John H.

    2012-01-01

    Neonatal odor-preference memory in rat pups is a well-defined associative mammalian memory model dependent on cAMP. Previous work from this laboratory demonstrates three phases of neonatal odor-preference memory: short-term (translation-independent), intermediate-term (translation-dependent), and long-term (transcription- and…

  5. Introducing PROFESS 2.0: A parallelized, fully linear scaling program for orbital-free density functional theory calculations

    Science.gov (United States)

    Hung, Linda; Huang, Chen; Shin, Ilgyou; Ho, Gregory S.; Lignères, Vincent L.; Carter, Emily A.

    2010-12-01

    Orbital-free density functional theory (OFDFT) is a first principles quantum mechanics method to find the ground-state energy of a system by variationally minimizing with respect to the electron density. No orbitals are used in the evaluation of the kinetic energy (unlike Kohn-Sham DFT), and the method scales nearly linearly with the size of the system. The PRinceton Orbital-Free Electronic Structure Software (PROFESS) uses OFDFT to model materials from the atomic scale to the mesoscale. This new version of PROFESS allows the study of larger systems with two significant changes: PROFESS is now parallelized, and the ion-electron and ion-ion terms scale quasilinearly, instead of quadratically as in PROFESS v1 (L. Hung and E.A. Carter, Chem. Phys. Lett. 475 (2009) 163). At the start of a run, PROFESS reads the various input files that describe the geometry of the system (ion positions and cell dimensions), the type of elements (defined by electron-ion pseudopotentials), the actions you want it to perform (minimize with respect to electron density and/or ion positions and/or cell lattice vectors), and the various options for the computation (such as which functionals you want it to use). Based on these inputs, PROFESS sets up a computation and performs the appropriate optimizations. Energies, forces, stresses, material geometries, and electron density configurations are some of the values that can be output throughout the optimization. New version program summaryProgram Title: PROFESS Catalogue identifier: AEBN_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEBN_v2_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 68 721 No. of bytes in distributed program, including test data, etc.: 1 708 547 Distribution format: tar.gz Programming language: Fortran 90 Computer

  6. Asymmetric Programming: A Highly Reliable Metadata Allocation Strategy for MLC NAND Flash Memory-Based Sensor Systems

    Science.gov (United States)

    Huang, Min; Liu, Zhaoqing; Qiao, Liyan

    2014-01-01

    While the NAND flash memory is widely used as the storage medium in modern sensor systems, the aggressive shrinking of process geometry and an increase in the number of bits stored in each memory cell will inevitably degrade the reliability of NAND flash memory. In particular, it's critical to enhance metadata reliability, which occupies only a small portion of the storage space, but maintains the critical information of the file system and the address translations of the storage system. Metadata damage will cause the system to crash or a large amount of data to be lost. This paper presents Asymmetric Programming, a highly reliable metadata allocation strategy for MLC NAND flash memory storage systems. Our technique exploits for the first time the property of the multi-page architecture of MLC NAND flash memory to improve the reliability of metadata. The basic idea is to keep metadata in most significant bit (MSB) pages which are more reliable than least significant bit (LSB) pages. Thus, we can achieve relatively low bit error rates for metadata. Based on this idea, we propose two strategies to optimize address mapping and garbage collection. We have implemented Asymmetric Programming on a real hardware platform. The experimental results show that Asymmetric Programming can achieve a reduction in the number of page errors of up to 99.05% with the baseline error correction scheme. PMID:25310473

  7. Integrated Network Decompositions and Dynamic Programming for Graph Optimization (INDDGO)

    Energy Technology Data Exchange (ETDEWEB)

    2012-05-31

    The INDDGO software package offers a set of tools for finding exact solutions to graph optimization problems via tree decompositions and dynamic programming algorithms. Currently the framework offers serial and parallel (distributed memory) algorithms for finding tree decompositions and solving the maximum weighted independent set problem. The parallel dynamic programming algorithm is implemented on top of the MADNESS task-based runtime.

  8. Parallel methods in problems of mathematical physics

    OpenAIRE

    Boris Rybakin

    1996-01-01

    The article deals with various methods of parallelization of algorithms of problems of mathematical physics. Parallel methods of solution of these problems on the basis of multiprocessor transputer based systems with distributed memory are considered.

  9. Multilingual interfaces for parallel coupling in multiphysics and multiscale systems.

    Energy Technology Data Exchange (ETDEWEB)

    Ong, E. T.; Larson, J. W.; Norris, B.; Jacob, R. L.; Tobis, M.; Steder, M.; Mathematics and Computer Science; Univ. of Wisconsin; Australian National Univ.; Univ. of Chicago

    2007-01-01

    Multiphysics and multiscale simulation systems are emerging as a new grand challenge in computational science, largely because of increased computing power provided by the distributed-memory parallel programming model on commodity clusters. These systems often present a parallel coupling problem in their intercomponent data exchanges. Another potential problem in these coupled systems is language interoperability between their various constituent codes. In anticipation of combined parallel coupling/language interoperability challenges, we have created a set of interlanguage bindings for a successful parallel coupling library, the Model Coupling Toolkit. We describe the method used for automatically generating the bindings using the Babel language interoperability tool, and illustrate with short examples how MCT can be used from the C++ and Python languages. We report preliminary performance reports for the MCT interpolation benchmark. We conclude with a discussion of the significance of this work to the rapid prototyping of large parallel coupled systems.

  10. Cyber-EDA: Estimation of Distribution Algorithms with Adaptive Memory Programming

    Directory of Open Access Journals (Sweden)

    Peng-Yeng Yin

    2013-01-01

    Full Text Available The estimation of distribution algorithm (EDA aims to explicitly model the probability distribution of the quality solutions to the underlying problem. By iterative filtering for quality solution from competing ones, the probability model eventually approximates the distribution of global optimum solutions. In contrast to classic evolutionary algorithms (EAs, EDA framework is flexible and is able to handle inter variable dependence, which usually imposes difficulties on classic EAs. The success of EDA relies on effective and efficient building of the probability model. This paper facilitates EDA from the adaptive memory programming (AMP domain which has developed several improved forms of EAs using the Cyber-EA framework. The experimental result on benchmark TSP instances supports our anticipation that the AMP strategies can enhance the performance of classic EDA by deriving a better approximation for the true distribution of the target solutions.

  11. MPI-based Parallel Programming and Implementation of Knapsack Problem%基于MPI的背包问题并行程序设计与实现

    Institute of Scientific and Technical Information of China (English)

    张居晓

    2011-01-01

    MPI(Message Passing Interface)message passing parallel programming is one of the criteria,outlined the concept and composition of MPI,focuses on support for parallel programming Message Passing Interface(MPI)and MPI parallel programming environment method,and gives a MPI parallel programming examples to illustrate the design of MPI processes and procedures common link between the serial programming.%MPI(Message Passing Interface)是消息传递并行程序设计的标准之一,概述了MPI的概念和组成,着重介绍了支持并行程序设计的消息传递接口(MPI)以及在MPI环境下的并行程序设计方法,并给出一个MPI并行程序设计实例,说明了MPI的程序设计流程和普通串行程序设计之间的关联。

  12. Thread-Based Automatic Parallel Conversion Technique for Java Program%基于线程的Java程序自动并行转换技术

    Institute of Scientific and Technical Information of China (English)

    刘英; 刘磊; 张乃孝

    2001-01-01

    Java程序的并行化研究是一个重要课题.提出一种Java程序的自动并行转换技术,并充分利用Java语言本身提供的多线程机制,通过操作冲突性检测等方法将串行化的Java程序自动转化成并行化程序.使得转化后的并行化程序在多处理机操作系统的支持下,能在共享内存的多处理机系统上运行,从而提高了程序的运行效率.%The study of parallelism for Java program is one of the mostimportant subjects at present. In this paper, a kind of automatic parallel conversion technique is given. The serial Java program is transformed to parallel program utilizing the multithreading machanism and testing technique of commutativity operations. The parallel program after transforming can run in the supercomputer with multi-CPU under the multi-processor operating system, which will enhance the programs' efficiency.

  13. Memory protection

    Science.gov (United States)

    Denning, Peter J.

    1988-01-01

    Accidental overwriting of files or of memory regions belonging to other programs, browsing of personal files by superusers, Trojan horses, and viruses are examples of breakdowns in workstations and personal computers that would be significantly reduced by memory protection. Memory protection is the capability of an operating system and supporting hardware to delimit segments of memory, to control whether segments can be read from or written into, and to confine accesses of a program to its segments alone. The absence of memory protection in many operating systems today is the result of a bias toward a narrow definition of performance as maximum instruction-execution rate. A broader definition, including the time to get the job done, makes clear that cost of recovery from memory interference errors reduces expected performance. The mechanisms of memory protection are well understood, powerful, efficient, and elegant. They add to performance in the broad sense without reducing instruction execution rate.

  14. Design strategies for irregularly adapting parallel applications

    Energy Technology Data Exchange (ETDEWEB)

    Oliker, Leonid; Biswas, Rupak; Shan, Hongzhang; Sing, Jaswinder Pal

    2000-11-01

    Achieving scalable performance for dynamic irregular applications is eminently challenging. Traditional message-passing approaches have been making steady progress towards this goal; however, they suffer from complex implementation requirements. The use of a global address space greatly simplifies the programming task, but can degrade the performance of dynamically adapting computations. In this work, we examine two major classes of adaptive applications, under five competing programming methodologies and four leading parallel architectures. Results indicate that it is possible to achieve message-passing performance using shared-memory programming techniques by carefully following the same high level strategies. Adaptive applications have computational work loads and communication patterns which change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient parallel implementations of such adaptive applications are therefore a challenging task. This work examines the implementation of two typical adaptive applications, Dynamic Remeshing and N-Body, across various programming paradigms and architectural platforms. We compare several critical factors of the parallel code development, including performance, programmability, scalability, algorithmic development, and portability.

  15. The geometric effect and programming current reduction in cylindrical-shaped phase change memory

    Science.gov (United States)

    Li, Yiming; Hwang, Chih-Hong; Li, Tien-Yeh; Cheng, Hui-Wen

    2009-07-01

    This study conducts a three-dimensional electro-thermal time-domain simulation for numerical analysis of cylindrical-shaped phase change memories (PCMs). The influence of chalcogenide material, germanium antimony telluride (GeSbTe or GST), structure on PCM operation is explored. GST with vertical structure exhibits promising characteristics. The bottom electrode contact (BEC) is advanced to improve the operation of PCMs, where a 25% reduction of the required programming current is achieved at a cost of 26% reduced resistance ratio. The position of the BEC is then shifted to further improve the performance of PCMs. The required programming current is reduced by a factor of 11, where the resistance ratio is only decreased by 6.9%. However, the PCMs with a larger shift of BEC are sensitive to process variation. To design PCMs with less than 10% programming current variation, PCMs with shifted BEC, where the shifted distance is equal to 1.5 times the BEC's radius, is worth considering. This study quantitatively estimates the structure effect on the phase transition of PCMs and physically provides an insight into the design and technology of PCMs.

  16. 多核机群上数据密集型应用并行程序性能优化%Parallel program performance optimization for data-intensive applications on multi-core clusters

    Institute of Scientific and Technical Information of China (English)

    黄华林; 钟诚

    2012-01-01

    在异构多核机群系统上利用数据任务块的动态调度策略和全锁定技术,给出一种面向数据密集型应用的结点内主存和可用的共享二级缓存大小中动态调度数据块的多进程级和多线程级并行编程机制,给出了优化数据密集型应用并行程序性能的策略和技术.在多核计算机组成的异构机群上并行求解随机序列多关键字查找的实验结果表明,所给出的多核并行程序设计机制和性能优化方法可行和高效.%Using dynamic data task block scheduling policy and all-locking technology on the heterogeneous multi-core clusters, this paper presents a hybrid parallel programming mechanism of multiprocess-level and multithreaded-level for the data-intensive applications, which can efficiently use the data in the main memory and dynamic schedule the data block in shared L2 cache, and presents the technology and strategy of paralleled application performance optimization for the data-intensive applications. The experiments for solving the multi-keyword search of random sequences parallelly on the heterogeneous multi-core clusters show that the parallel programming mechanism and performance optimization methods are usable and efficient.

  17. Determining the Number of Factors to Retain in EFA: An easy-to-use computer program for carrying out Parallel Analysis

    Directory of Open Access Journals (Sweden)

    Rubin Daniel Ledesma

    2007-02-01

    Full Text Available Parallel Analysis is a Monte Carlo simulation technique that aids researchers in determining the number of factors to retain in Principal Component and Exploratory Factor Analysis. This method provides a superior alternative to other techniques that are commonly used for the same purpose, such as the Scree test or the Kaiser's eigenvalue-greater-than-one rule. Nevertheless, Parallel Analysis is not well known among researchers, in part because it is not included as an analysis option in the most popular statistical packages. This paper describes and illustrates how to apply Parallel Analysis with an easy-to-use computer program called ViSta-PARAN. ViSta-PARAN is a user-friendly application that can compute and interpret Parallel Analysis. Its user interface is fully graphic and includes a dialog box to specify parameters, and specialized graphics to visualize the analysis output.

  18. A Parallel and Concurrent Implementation of Lin-Kernighan Heuristic (LKH-2 for Solving Traveling Salesman Problem for Multi-Core Processors using SPC3 Programming Model

    Directory of Open Access Journals (Sweden)

    Muhammad Ali Ismail

    2011-08-01

    Full Text Available With the arrival of multi-cores, every processor has now built-in parallel computational power and that can be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core processors. In this paper we have presented a combined parallel and concurrent implementation of Lin-Kernighan Heuristic (LKH-2 for Solving Travelling Salesman Problem (TSP using a newly developed parallel programming model, SPC3 PM, for general purpose multi-core processors. This implementation is found to be very simple, highly efficient, scalable and less time consuming in compare to the existing LKH-2 serial implementations in multi-core processing environment. We have tested our parallel implementation of LKH-2 with medium and large size TSP instances of TSBLIB. And for all these tests our proposed approach has shown much improved performance and scalability.

  19. RCCPAC: A parallel relativistic coupled-cluster program for closed-shell and one-valence atoms and ions in FORTRAN

    Science.gov (United States)

    Mani, B. K.; Chattopadhyay, S.; Angom, D.

    2017-04-01

    We report the development of a parallel FORTRAN code, RCCPAC, to solve the relativistic coupled-cluster equations for closed-shell and one-valence atoms and ions. The parallelization is implemented through the use of message passing interface, which is suitable for distributed memory computers. The coupled-cluster equations are defined in terms of the reduced matrix elements, and solved iteratively using Jacobi method. The ground and excited states of coupled-cluster wave functions obtained from the code could be used to compute different properties of closed-shell and one-valence atom or ion. As an example we compute the ground state correlation energy, attachment energies, E1 reduced matrix elements and hyperfine structure constants.

  20. NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems

    Energy Technology Data Exchange (ETDEWEB)

    Lee, Seyong [ORNL; Vetter, Jeffrey S [ORNL

    2016-01-01

    Computer architecture experts expect that non-volatile memory (NVM) hierarchies will play a more significant role in future systems including mobile, enterprise, and HPC architectures. With this expectation in mind, we present NVL-C: a novel programming system that facilitates the efficient and correct programming of NVM main memory systems. The NVL-C programming abstraction extends C with a small set of intuitive language features that target NVM main memory, and can be combined directly with traditional C memory model features for DRAM. We have designed these new features to enable compiler analyses and run-time checks that can improve performance and guard against a number of subtle programming errors, which, when left uncorrected, can corrupt NVM-stored data. Moreover, to enable recovery of data across application or system failures, these NVL-C features include a flexible directive for specifying NVM transactions. So that our implementation might be extended to other compiler front ends and languages, the majority of our compiler analyses are implemented in an extended version of LLVM's intermediate representation (LLVM IR). We evaluate NVL-C on a number of applications to show its flexibility, performance, and correctness.