Sample records for scalable task-oriented parallelism

  1. Equalizer: a scalable parallel rendering framework. (United States)

    Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato


    Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.

  2. Parallel scalability of Hartree-Fock calculations. (United States)

    Chow, Edmond; Liu, Xing; Smelyanskiy, Mikhail; Hammond, Jeff R


    Quantum chemistry is increasingly performed using large cluster computers consisting of multiple interconnected nodes. For a fixed molecular problem, the efficiency of a calculation usually decreases as more nodes are used, due to the cost of communication between the nodes. This paper empirically investigates the parallel scalability of Hartree-Fock calculations. The construction of the Fock matrix and the density matrix calculation are analyzed separately. For the former, we use a parallelization of Fock matrix construction based on a static partitioning of work followed by a work stealing phase. For the latter, we use density matrix purification from the linear scaling methods literature, but without using sparsity. When using large numbers of nodes for moderately sized problems, density matrix computations are network-bandwidth bound, making purification methods potentially faster than eigendecomposition methods.

  3. Current parallel I/O limitations to scalable data analysis.

    Energy Technology Data Exchange (ETDEWEB)

    Mascarenhas, Ajith Arthur; Pebay, Philippe Pierre


    This report describes the limitations to parallel scalability which we have encountered when applying our otherwise optimally scalable parallel statistical analysis tool kit to large data sets distributed across the parallel file system of the current premier DOE computational facility. This report describes our study to evaluate the effect of parallel I/O on the overall scalability of a parallel data analysis pipeline using our scalable parallel statistics tool kit [PTBM11]. In this goal, we tested it using the Jaguar-pf DOE/ORNL peta-scale platform on a large combustion simulation data under a variety of process counts and domain decompositions scenarios. In this report we have recalled the foundations of the parallel statistical analysis tool kit which we have designed and implemented, with the specific double intent of reproducing typical data analysis workflows, and achieving optimal design for scalable parallel implementations. We have briefly reviewed those earlier results and publications which allow us to conclude that we have achieved both goals. However, in this report we have further established that, when used in conjuction with a state-of-the-art parallel I/O system, as can be found on the premier DOE peta-scale platform, the scaling properties of the overall analysis pipeline comprising parallel data access routines degrade rapidly. This finding is problematic and must be addressed if peta-scale data analysis is to be made scalable, or even possible. In order to attempt to address these parallel I/O limitations, we will investigate the use the Adaptable IO System (ADIOS) [LZL+10] to improve I/O performance, while maintaining flexibility for a variety of IO options, such MPI IO, POSIX IO. This system is developed at ORNL and other collaborating institutions, and is being tested extensively on Jaguar-pf. Simulation code being developed on these systems will also use ADIOS to output the data thereby making it easier for other systems, such as ours, to

  4. Parallelism and Scalability in an Image Processing Application

    DEFF Research Database (Denmark)

    Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven


    parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately......The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chips. This means that parallel processing is required in application areas that traditionally have not used...

  5. Parallelism and Scalability in an Image Processing Application

    DEFF Research Database (Denmark)

    Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven


    parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately......The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chip. This means that parallel processing is required in application areas that traditionally have not used...

  6. A Scalable Prescriptive Parallel Debugging Model

    DEFF Research Database (Denmark)

    Jensen, Nicklas Bo; Quarfot Nielsen, Niklas; Lee, Gregory L.


    Debugging is a critical step in the development of any parallel program. However, the traditional interactive debugging model, where users manually step through code and inspect their application, does not scale well even for current supercomputers due its centralized nature. While lightweight...

  7. Scalable parallel prefix solvers for discrete ordinates transport

    International Nuclear Information System (INIS)

    Pautz, S.; Pandya, T.; Adams, M.


    The well-known 'sweep' algorithm for inverting the streaming-plus-collision term in first-order deterministic radiation transport calculations has some desirable numerical properties. However, it suffers from parallel scaling issues caused by a lack of concurrency. The maximum degree of concurrency, and thus the maximum parallelism, grows more slowly than the problem size for sweeps-based solvers. We investigate a new class of parallel algorithms that involves recasting the streaming-plus-collision problem in prefix form and solving via cyclic reduction. This method, although computationally more expensive at low levels of parallelism than the sweep algorithm, offers better theoretical scalability properties. Previous work has demonstrated this approach for one-dimensional calculations; we show how to extend it to multidimensional calculations. Notably, for multiple dimensions it appears that this approach is limited to long-characteristics discretizations; other discretizations cannot be cast in prefix form. We implement two variants of the algorithm within the radlib/SCEPTRE transport code library at Sandia National Laboratories and show results on two different massively parallel systems. Both the 'forward' and 'symmetric' solvers behave similarly, scaling well to larger degrees of parallelism then sweeps-based solvers. We do observe some issues at the highest levels of parallelism (relative to the system size) and discuss possible causes. We conclude that this approach shows good potential for future parallel systems, but the parallel scalability will depend heavily on the architecture of the communication networks of these systems. (authors)

  8. Parallelizing SLPA for Scalable Overlapping Community Detection

    Directory of Open Access Journals (Sweden)

    Konstantin Kuzmin


    Full Text Available Communities in networks are groups of nodes whose connections to the nodes in a community are stronger than with the nodes in the rest of the network. Quite often nodes participate in multiple communities; that is, communities can overlap. In this paper, we first analyze what other researchers have done to utilize high performance computing to perform efficient community detection in social, biological, and other networks. We note that detection of overlapping communities is more computationally intensive than disjoint community detection, and the former presents new challenges that algorithm designers have to face. Moreover, the efficiency of many existing algorithms grows superlinearly with the network size making them unsuitable to process large datasets. We use the Speaker-Listener Label Propagation Algorithm (SLPA as the basis for our parallel overlapping community detection implementation. SLPA provides near linear time overlapping community detection and is well suited for parallelization. We explore the benefits of a multithreaded programming paradigm and show that it yields a significant performance gain over sequential execution while preserving the high quality of community detection. The algorithm was tested on four real-world datasets with up to 5.5 million nodes and 170 million edges. In order to assess the quality of community detection, at least 4 different metrics were used for each of the datasets.

  9. A scalable parallel black oil simulator on distributed memory parallel computers (United States)

    Wang, Kun; Liu, Hui; Chen, Zhangxin


    This paper presents our work on developing a parallel black oil simulator for distributed memory computers based on our in-house parallel platform. The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations. The finite difference method is applied to discretize the black oil model. In addition, some advanced techniques are employed to strengthen the robustness and parallel scalability of the simulator, including an inexact Newton method, matrix decoupling methods, and algebraic multigrid methods. A new multi-stage preconditioner is proposed to accelerate the solution of linear systems from the Newton methods. Numerical experiments show that our simulator is scalable and efficient, and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.

  10. Task-oriented rehabilitation robotics. (United States)

    Schweighofer, Nicolas; Choi, Younggeun; Winstein, Carolee; Gordon, James


    Task-oriented training is emerging as the dominant and most effective approach to motor rehabilitation of upper extremity function after stroke. Here, the authors propose that the task-oriented training framework provides an evidence-based blueprint for the design of task-oriented robots for the rehabilitation of upper extremity function in the form of three design principles: skill acquisition of functional tasks, active participation training, and individualized adaptive training. The previous robotic systems that incorporate elements of task-oriented trainings are then reviewed. Finally, the authors critically analyze their own attempt to design and test the feasibility of a TOR robot, ADAPT (Adaptive and Automatic Presentation of Tasks), which incorporates the three design principles. Because of its task-oriented training-based design, ADAPT departs from most other current rehabilitation robotic systems: it presents realistic functional tasks in which the task goal is constantly adapted, so that the individual actively performs doable but challenging tasks without physical assistance. To maximize efficacy for a large clinical population, the authors propose that future task-oriented robots need to incorporate yet-to-be developed adaptive task presentation algorithms that emphasize acquisition of fine motor coordination skills while minimizing compensatory movements.

  11. A scalable parallel algorithm for multiple objective linear programs (United States)

    Wiecek, Malgorzata M.; Zhang, Hong


    This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.

  12. Final Report: Center for Programming Models for Scalable Parallel Computing

    Energy Technology Data Exchange (ETDEWEB)

    Mellor-Crummey, John [William Marsh Rice University


    As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.

  13. Center for Programming Models for Scalable Parallel Computing

    Energy Technology Data Exchange (ETDEWEB)

    John Mellor-Crummey


    Rice University's achievements as part of the Center for Programming Models for Scalable Parallel Computing include: (1) design and implemention of cafc, the first multi-platform CAF compiler for distributed and shared-memory machines, (2) performance studies of the efficiency of programs written using the CAF and UPC programming models, (3) a novel technique to analyze explicitly-parallel SPMD programs that facilitates optimization, (4) design, implementation, and evaluation of new language features for CAF, including communication topologies, multi-version variables, and distributed multithreading to simplify development of high-performance codes in CAF, and (5) a synchronization strength reduction transformation for automatically replacing barrier-based synchronization with more efficient point-to-point synchronization. The prototype Co-array Fortran compiler cafc developed in this project is available as open source software from

  14. A scalable implementation of RI-SCF on parallel computers

    International Nuclear Information System (INIS)

    Fruechtl, H.A.; Kendall, R.A.; Harrison, R.J.


    In order to avoid the integral bottleneck of conventional SCF calculations, the Resolution of the Identity (RI) method is used to obtain an approximate solution to the Hartree-Fock equations. In this approximation only three-center integrals are needed to build the Fock matrix. It has been implemented as part of the NWChem package of portable and scalable ab initio programs for parallel computers. Utilizing the V-approximation, both the Coulomb and exchange contribution to the Fock matrix can be calculated from a transformed set of three-center integrals which have to be precalculated and stored. A distributed in-core method as well as a disk based implementation have been programmed. Details of the implementation as well as the parallel programming tools used are described. We also give results and timings from benchmark calculations

  15. Parallel peak pruning for scalable SMP contour tree computation

    Energy Technology Data Exchange (ETDEWEB)

    Carr, Hamish A. [Univ. of Leeds (United Kingdom); Weber, Gunther H. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Davis, CA (United States); Sewell, Christopher M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Ahrens, James P. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)


    As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this form of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.

  16. CX: A Scalable, Robust Network for Parallel Computing

    Directory of Open Access Journals (Sweden)

    Peter Cappello


    Full Text Available CX, a network-based computational exchange, is presented. The system's design integrates variations of ideas from other researchers, such as work stealing, non-blocking tasks, eager scheduling, and space-based coordination. The object-oriented API is simple, compact, and cleanly separates application logic from the logic that supports interprocess communication and fault tolerance. Computations, of course, run to completion in the presence of computational hosts that join and leave the ongoing computation. Such hosts, or producers, use task caching and prefetching to overlap computation with interprocessor communication. To break a potential task server bottleneck, a network of task servers is presented. Even though task servers are envisioned as reliable, the self-organizing, scalable network of n- servers, described as a sibling-connected height-balanced fat tree, tolerates a sequence of n-1 server failures. Tasks are distributed throughout the server network via a simple "diffusion" process. CX is intended as a test bed for research on automated silent auctions, reputation services, authentication services, and bonding services. CX also provides a test bed for algorithm research into network-based parallel computation.

  17. Combined Scheduling and Mapping for Scalable Computing with Parallel Tasks

    Directory of Open Access Journals (Sweden)

    Jörg Dümmler


    Full Text Available Recent and future parallel clusters and supercomputers use symmetric multiprocessors (SMPs and multi-core processors as basic nodes, providing a huge amount of parallel resources. These systems often have hierarchically structured interconnection networks combining computing resources at different levels, starting with the interconnect within multi-core processors up to the interconnection network combining nodes of the cluster or supercomputer. The challenge for the programmer is that these computing resources should be utilized efficiently by exploiting the available degree of parallelism of the application program and by structuring the application in a way which is sensitive to the heterogeneous interconnect. In this article, we pursue a parallel programming method using parallel tasks to structure parallel implementations. A parallel task can be executed by multiple processors or cores and, for each activation of a parallel task, the actual number of executing cores can be adapted to the specific execution situation. In particular, we propose a new combined scheduling and mapping technique for parallel tasks with dependencies that takes the hierarchical structure of modern multi-core clusters into account. An experimental evaluation shows that the presented programming approach can lead to a significantly higher performance compared to standard data parallel implementations.

  18. Mapping robust parallel multigrid algorithms to scalable memory architectures (United States)

    Overman, Andrea; Vanrosendale, John


    The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid (MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. This paper describes a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for distributed-memory machines.

  19. Scalable Parallel Algorithms for Formal Verification of Software Project (United States)

    National Aeronautics and Space Administration — We will develop a prototype of a GPU-based parallel Binary Decision Diagram (BDD) software package. BDDs are a data structure that satisfies some simple...

  20. Scalable Parallel Algorithms for Formal Verification of Software Project (United States)

    National Aeronautics and Space Administration — We will develop an efficient Graphics Processing Unit (GPU) based parallel Binary Decision Diagram (BDD) software package, and will also combine it with our...

  1. Scalability of Parallel Scientific Applications on the Cloud

    Directory of Open Access Journals (Sweden)

    Satish Narayana Srirama


    Full Text Available Cloud computing, with its promise of virtually infinite resources, seems to suit well in solving resource greedy scientific computing problems. To study the effects of moving parallel scientific applications onto the cloud, we deployed several benchmark applications like matrix–vector operations and NAS parallel benchmarks, and DOUG (Domain decomposition On Unstructured Grids on the cloud. DOUG is an open source software package for parallel iterative solution of very large sparse systems of linear equations. The detailed analysis of DOUG on the cloud showed that parallel applications benefit a lot and scale reasonable on the cloud. We could also observe the limitations of the cloud and its comparison with cluster in terms of performance. However, for efficiently running the scientific applications on the cloud infrastructure, the applications must be reduced to frameworks that can successfully exploit the cloud resources, like the MapReduce framework. Several iterative and embarrassingly parallel algorithms are reduced to the MapReduce model and their performance is measured and analyzed. The analysis showed that Hadoop MapReduce has significant problems with iterative methods, while it suits well for embarrassingly parallel algorithms. Scientific computing often uses iterative methods to solve large problems. Thus, for scientific computing on the cloud, this paper raises the necessity for better frameworks or optimizations for MapReduce.

  2. The development of a scalable parallel 3-D CFD algorithm for turbomachinery. M.S. Thesis Final Report (United States)

    Luke, Edward Allen


    Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.

  3. On Scalable Deep Learning and Parallelizing Gradient Descent

    CERN Document Server

    AUTHOR|(CDS)2129036; Möckel, Rico; Baranowski, Zbigniew; Canali, Luca

    Speeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic. Therefore, data parallel methods next to mini-batch parallelism have been suggested to further decrease the training time of parameterized models using gradient based methods. Nevertheless, asynchronous optimization was considered too unstable for practical purposes due to a lacking understanding of the underlying mechanisms. Recently, a theoretical contribution has been made which defines asynchronous optimization in terms of (implicit) momentum due to the presence of a queuing model of gradients based on past parameterizations. This thesis mainly builds upon this work to construct a better understanding why asynchronous optimization shows proportionally more divergent behavior when the number of parallel worker...

  4. A scalable method for parallelizing sampling-based motion planning algorithms

    KAUST Repository

    Jacobs, Sam Ade


    This paper describes a scalable method for parallelizing sampling-based motion planning algorithms. It subdivides configuration space (C-space) into (possibly overlapping) regions and independently, in parallel, uses standard (sequential) sampling-based planners to construct roadmaps in each region. Next, in parallel, regional roadmaps in adjacent regions are connected to form a global roadmap. By subdividing the space and restricting the locality of connection attempts, we reduce the work and inter-processor communication associated with nearest neighbor calculation, a critical bottleneck for scalability in existing parallel motion planning methods. We show that our method is general enough to handle a variety of planning schemes, including the widely used Probabilistic Roadmap (PRM) and Rapidly-exploring Random Trees (RRT) algorithms. We compare our approach to two other existing parallel algorithms and demonstrate that our approach achieves better and more scalable performance. Our approach achieves almost linear scalability on a 2400 core LINUX cluster and on a 153,216 core Cray XE6 petascale machine. © 2012 IEEE.

  5. Parallel and Scalable Sparse Basic Linear Algebra Subprograms

    DEFF Research Database (Denmark)

    Liu, Weifeng

    proposes new fundamental algorithms and data structures that accelerate Sparse BLAS routines on modern massively parallel processors: (1) a new heap data structure named ad-heap, for faster heap operations on heterogeneous processors, (2) a new sparse matrix representation named CSR5, for faster sparse......Sparse basic linear algebra subprograms (BLAS) are fundamental building blocks for numerous scientific computations and graph applications. Compared with Dense BLAS, parallelization of Sparse BLAS routines entails extra challenges due to the irregularity of sparse data structures. This thesis...... matrix-vector multiplication (SpMV) on homogeneous processors such as CPUs, GPUs and Xeon Phi, (3) a new CSR-based SpMV algorithm for a variety of tightly coupled CPU-GPU heterogeneous processors, and (4) a new framework and associated algorithms for sparse matrix-matrix multiplication (SpGEMM) on GPUs...

  6. Using Python to Construct a Scalable Parallel Nonlinear Wave Solver

    KAUST Repository

    Mandli, Kyle


    Computational scientists seek to provide efficient, easy-to-use tools and frameworks that enable application scientists within a specific discipline to build and/or apply numerical models with up-to-date computing technologies that can be executed on all available computing systems. Although many tools could be useful for groups beyond a specific application, it is often difficult and time consuming to combine existing software, or to adapt it for a more general purpose. Python enables a high-level approach where a general framework can be supplemented with tools written for different fields and in different languages. This is particularly important when a large number of tools are necessary, as is the case for high performance scientific codes. This motivated our development of PetClaw, a scalable distributed-memory solver for time-dependent nonlinear wave propagation, as a case-study for how Python can be used as a highlevel framework leveraging a multitude of codes, efficient both in the reuse of code and programmer productivity. We present scaling results for computations on up to four racks of Shaheen, an IBM BlueGene/P supercomputer at King Abdullah University of Science and Technology. One particularly important issue that PetClaw has faced is the overhead associated with dynamic loading leading to catastrophic scaling. We use the walla library to solve the issue which does so by supplanting high-cost filesystem calls with MPI operations at a low enough level that developers may avoid any changes to their codes.

  7. Final Report: Super Instruction Architecture for Scalable Parallel Computations

    Energy Technology Data Exchange (ETDEWEB)

    Sanders, Beverly Ann [Univ. of Florida, Gainesville, FL (United States)


    The most advanced methods for reliable and accurate computation of the electronic structure of molecular and nano systems are the coupled-cluster techniques. These high-accuracy methods help us to understand, for example, how biological enzymes operate and contribute to the design of new organic explosives. The ACES III software provides a modern, high-performance implementation of these methods optimized for high performance parallel computer systems, ranging from small clusters typical in individual research groups, through larger clusters available in campus and regional computer centers, all the way to high-end petascale systems at national labs, including exploiting GPUs if available. This project enhanced the ACESIII software package and used it to study interesting scientific problems.

  8. Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

    Energy Technology Data Exchange (ETDEWEB)

    Daily, Jeffrey A. [Washington State Univ., Pullman, WA (United States)


    The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore’s law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores

  9. Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale

    International Nuclear Information System (INIS)

    Daily, Jeffrey A.


    The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore's law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or 'homologous') on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K

  10. Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes (United States)

    Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; Stuehn, Torsten


    Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach, the theoretical modeling and scaling laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. These two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.

  11. Parallel scalability and efficiency of vortex particle method for aeroelasticity analysis of bluff bodies (United States)

    Tolba, Khaled Ibrahim; Morgenthal, Guido


    This paper presents an analysis of the scalability and efficiency of a simulation framework based on the vortex particle method. The code is applied for the numerical aerodynamic analysis of line-like structures. The numerical code runs on multicore CPU and GPU architectures using OpenCL framework. The focus of this paper is the analysis of the parallel efficiency and scalability of the method being applied to an engineering test case, specifically the aeroelastic response of a long-span bridge girder at the construction stage. The target is to assess the optimal configuration and the required computer architecture, such that it becomes feasible to efficiently utilise the method within the computational resources available for a regular engineering office. The simulations and the scalability analysis are performed on a regular gaming type computer.

  12. A scalable approach to modeling groundwater flow on massively parallel computers

    International Nuclear Information System (INIS)

    Ashby, S.F.; Falgout, R.D.; Tompson, A.F.B.


    We describe a fully scalable approach to the simulation of groundwater flow on a hierarchy of computing platforms, ranging from workstations to massively parallel computers. Specifically, we advocate the use of scalable conceptual models in which the subsurface model is defined independently of the computational grid on which the simulation takes place. We also describe a scalable multigrid algorithm for computing the groundwater flow velocities. We axe thus able to leverage both the engineer's time spent developing the conceptual model and the computing resources used in the numerical simulation. We have successfully employed this approach at the LLNL site, where we have run simulations ranging in size from just a few thousand spatial zones (on workstations) to more than eight million spatial zones (on the CRAY T3D)-all using the same conceptual model

  13. Task oriented evaluation system for maintenance robots

    International Nuclear Information System (INIS)

    Asame, Hajime; Endo, Isao; Kotosaka, Shin-ya; Takata, Shozo; Hiraoka, Hiroyuki; Kohda, Takehisa; Matsumoto, Akihiro; Yamagishi, Kiichiro.


    The adaptability evaluation of maintenance robots to autonomous plants has been discussed. In this paper, a new concept of autonomous plant with maintenance robots are introduced, and a framework of autonomous maintenance system is proposed. Then, task-oriented evaluation of robot arms is discussed for evaluating their adaptability to maintenance tasks, and a new criterion called operability is proposed for adaptability evaluation. The task-oriented evaluation system is implemented and applied to structural design of robot arms. Using genetic algorithm, an optimal structure adaptable to a pump disassembly task is obtained. (author)

  14. A Scalable Parallel PWTD-Accelerated SIE Solver for Analyzing Transient Scattering from Electrically Large Objects

    KAUST Repository

    Liu, Yang


    A scalable parallel plane-wave time-domain (PWTD) algorithm for efficient and accurate analysis of transient scattering from electrically large objects is presented. The algorithm produces scalable communication patterns on very large numbers of processors by leveraging two mechanisms: (i) a hierarchical parallelization strategy to evenly distribute the computation and memory loads at all levels of the PWTD tree among processors, and (ii) a novel asynchronous communication scheme to reduce the cost and memory requirement of the communications between the processors. The efficiency and accuracy of the algorithm are demonstrated through its applications to the analysis of transient scattering from a perfect electrically conducting (PEC) sphere with a diameter of 70 wavelengths and a PEC square plate with a dimension of 160 wavelengths. Furthermore, the proposed algorithm is used to analyze transient fields scattered from realistic airplane and helicopter models under high frequency excitation.

  15. Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms. (United States)

    Yu, Leiming; Nina-Paravecino, Fanny; Kaeli, David; Fang, Qianqian


    We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).

  16. Development, Verification and Validation of Parallel, Scalable Volume of Fluid CFD Program for Propulsion Applications (United States)

    West, Jeff; Yang, H. Q.


    There are many instances involving liquid/gas interfaces and their dynamics in the design of liquid engine powered rockets such as the Space Launch System (SLS). Some examples of these applications are: Propellant tank draining and slosh, subcritical condition injector analysis for gas generators, preburners and thrust chambers, water deluge mitigation for launch induced environments and even solid rocket motor liquid slag dynamics. Commercially available CFD programs simulating gas/liquid interfaces using the Volume of Fluid approach are currently limited in their parallel scalability. In 2010 for instance, an internal NASA/MSFC review of three commercial tools revealed that parallel scalability was seriously compromised at 8 cpus and no additional speedup was possible after 32 cpus. Other non-interface CFD applications at the time were demonstrating useful parallel scalability up to 4,096 processors or more. Based on this review, NASA/MSFC initiated an effort to implement a Volume of Fluid implementation within the unstructured mesh, pressure-based algorithm CFD program, Loci-STREAM. After verification was achieved by comparing results to the commercial CFD program CFD-Ace+, and validation by direct comparison with data, Loci-STREAM-VoF is now the production CFD tool for propellant slosh force and slosh damping rate simulations at NASA/MSFC. On these applications, good parallel scalability has been demonstrated for problems sizes of tens of millions of cells and thousands of cpu cores. Ongoing efforts are focused on the application of Loci-STREAM-VoF to predict the transient flow patterns of water on the SLS Mobile Launch Platform in order to support the phasing of water for launch environment mitigation so that vehicle determinantal effects are not realized.

  17. Scalable Parallel Distributed Coprocessor System for Graph Searching Problems with Massive Data

    Directory of Open Access Journals (Sweden)

    Wanrong Huang


    Full Text Available The Internet applications, such as network searching, electronic commerce, and modern medical applications, produce and process massive data. Considerable data parallelism exists in computation processes of data-intensive applications. A traversal algorithm, breadth-first search (BFS, is fundamental in many graph processing applications and metrics when a graph grows in scale. A variety of scientific programming methods have been proposed for accelerating and parallelizing BFS because of the poor temporal and spatial locality caused by inherent irregular memory access patterns. However, new parallel hardware could provide better improvement for scientific methods. To address small-world graph problems, we propose a scalable and novel field-programmable gate array-based heterogeneous multicore system for scientific programming. The core is multithread for streaming processing. And the communication network InfiniBand is adopted for scalability. We design a binary search algorithm to address mapping to unify all processor addresses. Within the limits permitted by the Graph500 test bench after 1D parallel hybrid BFS algorithm testing, our 8-core and 8-thread-per-core system achieved superior performance and efficiency compared with the prior work under the same degree of parallelism. Our system is efficient not as a special acceleration unit but as a processor platform that deals with graph searching applications.

  18. A highly scalable massively parallel fast marching method for the Eikonal equation (United States)

    Yang, Jianming; Stern, Frederick


    The fast marching method is a widely used numerical method for solving the Eikonal equation arising from a variety of scientific and engineering fields. It is long deemed inherently sequential and an efficient parallel algorithm applicable to large-scale practical applications is not available in the literature. In this study, we present a highly scalable massively parallel implementation of the fast marching method using a domain decomposition approach. Central to this algorithm is a novel restarted narrow band approach that coordinates the frequency of communications and the amount of computations extra to a sequential run for achieving an unprecedented parallel performance. Within each restart, the narrow band fast marching method is executed; simple synchronous local exchanges and global reductions are adopted for communicating updated data in the overlapping regions between neighboring subdomains and getting the latest front status, respectively. The independence of front characteristics is exploited through special data structures and augmented status tags to extract the masked parallelism within the fast marching method. The efficiency, flexibility, and applicability of the parallel algorithm are demonstrated through several examples. These problems are extensively tested on six grids with up to 1 billion points using different numbers of processes ranging from 1 to 65536. Remarkable parallel speedups are achieved using tens of thousands of processes. Detailed pseudo-codes for both the sequential and parallel algorithms are provided to illustrate the simplicity of the parallel implementation and its similarity to the sequential narrow band fast marching algorithm.

  19. Learning to Model Task-Oriented Attention. (United States)

    Zou, Xiaochun; Zhao, Xinbo; Wang, Jian; Yang, Yongjia


    For many applications in graphics, design, and human computer interaction, it is essential to understand where humans look in a scene with a particular task. Models of saliency can be used to predict fixation locations, but a large body of previous saliency models focused on free-viewing task. They are based on bottom-up computation that does not consider task-oriented image semantics and often does not match actual eye movements. To address this problem, we collected eye tracking data of 11 subjects when they performed some particular search task in 1307 images and annotation data of 2,511 segmented objects with fine contours and 8 semantic attributes. Using this database as training and testing examples, we learn a model of saliency based on bottom-up image features and target position feature. Experimental results demonstrate the importance of the target information in the prediction of task-oriented visual attention.

  20. A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC

    Directory of Open Access Journals (Sweden)

    Yun-gang Xue


    Full Text Available We propose a highly parallel and scalable motion estimation algorithm, named multilevel resolution motion estimation (MLRME for short, by combining the advantages of local full search and downsampling. By subsampling a video frame, a large amount of computation is saved. While using the local full-search method, it can exploit massive parallelism and make full use of the powerful modern many-core accelerators, such as GPU and Intel Xeon Phi. We implanted the proposed MLRME into HM12.0, and the experimental results showed that the encoding quality of the MLRME method is close to that of the fast motion estimation in HEVC, which declines by less than 1.5%. We also implemented the MLRME with CUDA, which obtained 30–60x speed-up compared to the serial algorithm on single CPU. Specifically, the parallel implementation of MLRME on a GTX 460 GPU can meet the real-time coding requirement with about 25 fps for the 2560×1600 video format, while, for 832×480, the performance is more than 100 fps.

  1. Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms

    KAUST Repository

    Fidel, Adam


    Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores. © 2014 IEEE.

  2. Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications

    International Nuclear Information System (INIS)

    Hoisie, A.; Lubeck, O.; Wasserman, H.


    The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavefronts. They validate the model on three important supercomputer systems, on up to 500 processors. They use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. They also use the validated model to make estimates of performance and scalability of wavefront algorithms on 100-TFLOPS computer systems expected to be in existence within the next decade as part of the ASCI program and elsewhere. In this context, they analyze two problem sizes. The model shows that on the largest such problem (1 billion cells), inter-processor communication performance is not the bottleneck. Single-node efficiency is the dominant factor

  3. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computational science. (United States)

    Marek, A; Blum, V; Johanni, R; Havu, V; Lang, B; Auckenthaler, T; Heinecke, A; Bungartz, H-J; Lederer, H


    Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N(3)) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem

  4. Task Oriented Training and Evaluation at Home. (United States)

    Rowe, Veronica T; Neville, Marsha


    Principles of experience-dependent plasticity, motor learning theory, and the theory of Occupational Adaptation coalesce into a translational model for practice in neurorehabilitation. The objective of this study was to explore the effectiveness of a Task Oriented Training and Evaluation at Home (TOTE Home) program completed by people with subacute stroke, and whether effects persisted 1 month after this training. A single-subject design included a maximum of 30, 1hour sessions of training conducted in participants' homes. Repeated target measures of accelerometry and level of confidence were used to assess movement and confidence in weaker arm use through baseline, intervention, and follow-up phases of TOTE Home. Four participants completed TOTE Home and each demonstrated improvement in movement and confidence in function. The degree of improvement varied between participants, but a detectable change was evident in outcome measures. TOTE Home, using client-centered, salient tasks not only improved motor function but also facilitated an adaptive response demonstrated in continued improvement beyond the intervention.

  5. Structure and Function of Task-Oriented Social Networks (United States)


    This is the final report on our AFOSR grant titled Structure and Function of Task-Oriented Social Networks . The goal of this project supported by the grant was to integrate social networks with other empirical data in task oriented projects, in particular open source software projects.

  6. Task-oriented training in rehabilitation after stroke : systematic review

    NARCIS (Netherlands)

    Rensink, Marijke; Schuurmans, Marieke; Lindeman, Eline; Hafsteinsdottir, Thora

    Task-oriented training in rehabilitation after stroke: systematic review. This paper is a report of a review conducted to provide an overview of the evidence in the literature on task-oriented training of stroke survivors and its relevance in daily nursing practice. Stroke is the second leading

  7. A scalable fully implicit framework for reservoir simulation on parallel computers

    KAUST Repository

    Yang, Haijian


    The modeling of multiphase fluid flow in porous medium is of interest in the field of reservoir simulation. The promising numerical methods in the literature are mostly based on the explicit or semi-implicit approach, which both have certain stability restrictions on the time step size. In this work, we introduce and study a scalable fully implicit solver for the simulation of two-phase flow in a porous medium with capillarity, gravity and compressibility, which is free from the limitations of the conventional methods. In the fully implicit framework, a mixed finite element method is applied to discretize the model equations for the spatial terms, and the implicit Backward Euler scheme with adaptive time stepping is used for the temporal integration. The resultant nonlinear system arising at each time step is solved in a monolithic way by using a Newton–Krylov type method. The corresponding linear system from the Newton iteration is large sparse, nonsymmetric and ill-conditioned, consequently posing a significant challenge to the fully implicit solver. To address this issue, the family of additive Schwarz preconditioners is taken into account to accelerate the convergence of the linear system, and thereby improves the robustness of the outer Newton method. Several test cases in one, two and three dimensions are used to validate the correctness of the scheme and examine the performance of the newly developed algorithm on parallel computers.

  8. Scalable parallel programming for high performance seismic simulation on petascale heterogeneous supercomputers (United States)

    Zhou, Jun

    The 1994 Northridge earthquake in Los Angeles, California, killed 57 people, injured over 8,700 and caused an estimated $20 billion in damage. Petascale simulations are needed in California and elsewhere to provide society with a better understanding of the rupture and wave dynamics of the largest earthquakes at shaking frequencies required to engineer safe structures. As the heterogeneous supercomputing infrastructures are becoming more common, numerical developments in earthquake system research are particularly challenged by the dependence on the accelerator elements to enable "the Big One" simulations with higher frequency and finer resolution. Reducing time to solution and power consumption are two primary focus area today for the enabling technology of fault rupture dynamics and seismic wave propagation in realistic 3D models of the crust's heterogeneous structure. This dissertation presents scalable parallel programming techniques for high performance seismic simulation running on petascale heterogeneous supercomputers. A real world earthquake simulation code, AWP-ODC, one of the most advanced earthquake codes to date, was chosen as the base code in this research, and the testbed is based on Titan at Oak Ridge National Laboraratory, the world's largest hetergeneous supercomputer. The research work is primarily related to architecture study, computation performance tuning and software system scalability. An earthquake simulation workflow has also been developed to support the efficient production sets of simulations. The highlights of the technical development are an aggressive performance optimization focusing on data locality and a notable data communication model that hides the data communication latency. This development results in the optimal computation efficiency and throughput for the 13-point stencil code on heterogeneous systems, which can be extended to general high-order stencil codes. Started from scratch, the hybrid CPU/GPU version of AWP

  9. The Computational Complexity, Parallel Scalability, and Performance of Atmospheric Data Assimilation Algorithms (United States)

    Lyster, Peter M.; Guo, J.; Clune, T.; Larson, J. W.; Atlas, Robert (Technical Monitor)


    The computational complexity of algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO) is discussed. In 4DDA, observations are assimilated with the output of a dynamical model to generate best-estimates of the states of the system. It is thus a mapping problem, whereby scattered observations are converted into regular accurate maps of wind, temperature, moisture and other variables. The DAO is developing and using 4DDA algorithms that provide these datasets, or analyses, in support of Earth System Science research. Two large-scale algorithms are discussed. The first approach, the Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). GEOS DAS is very similar to global meteorological weather forecasting data assimilation systems, but is used at NASA for climate research. Systems of this size typically run at between 1 and 20 gigaflop/s. The second approach, the Kalman filter, uses a more consistent algorithm to determine the forecast error covariance matrix than does GEOS DAS. For atmospheric assimilation, the gridded dynamical fields typically have More than 10(exp 6) variables, therefore the full error covariance matrix may be in excess of a teraword. For the Kalman filter this problem can easily scale to petaflop/s proportions. We discuss the computational complexity of GEOS DAS and our implementation of the Kalman filter. We also discuss and quantify some of the technical issues and limitations in developing efficient, in terms of wall clock time, and scalable parallel implementations of the algorithms.

  10. A derivation and scalable implementation of the synchronous parallel kinetic Monte Carlo method for simulating long-time dynamics (United States)

    Byun, Hye Suk; El-Naggar, Mohamed Y.; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya


    Kinetic Monte Carlo (KMC) simulations are used to study long-time dynamics of a wide variety of systems. Unfortunately, the conventional KMC algorithm is not scalable to larger systems, since its time scale is inversely proportional to the simulated system size. A promising approach to resolving this issue is the synchronous parallel KMC (SPKMC) algorithm, which makes the time scale size-independent. This paper introduces a formal derivation of the SPKMC algorithm based on local transition-state and time-dependent Hartree approximations, as well as its scalable parallel implementation based on a dual linked-list cell method. The resulting algorithm has achieved a weak-scaling parallel efficiency of 0.935 on 1024 Intel Xeon processors for simulating biological electron transfer dynamics in a 4.2 billion-heme system, as well as decent strong-scaling parallel efficiency. The parallel code has been used to simulate a lattice of cytochrome complexes on a bacterial-membrane nanowire, and it is broadly applicable to other problems such as computational synthesis of new materials.

  11. Accurate and Scalable O(N) Algorithm for First-Principles Molecular-Dynamics Computations on Large Parallel Computers

    Energy Technology Data Exchange (ETDEWEB)

    Osei-Kuffuor, Daniel [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Fattebert, Jean-Luc [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)


    We present the first truly scalable first-principles molecular dynamics algorithm with O(N) complexity and controllable accuracy, capable of simulating systems with finite band gaps of sizes that were previously impossible with this degree of accuracy. By avoiding global communications, we provide a practical computational scheme capable of extreme scalability. Accuracy is controlled by the mesh spacing of the finite difference discretization, the size of the localization regions in which the electronic wave functions are confined, and a cutoff beyond which the components of the overlap matrix can be omitted when computing selected elements of its inverse. We demonstrate the algorithm's excellent parallel scaling for up to 101 952 atoms on 23 328 processors, with a wall-clock time of the order of 1 min per molecular dynamics time step and numerical error on the forces of less than 7x10-4 Ha/Bohr.

  12. Fast volume reconstruction in positron emission tomography: Implementation of four algorithms on a high-performance scalable parallel platform

    International Nuclear Information System (INIS)

    Egger, M.L.; Scheurer, A.H.; Joseph, C.


    The issue of long reconstruction times in PET has been addressed from several points of view, resulting in an affordable dedicated system capable of handling routine 3D reconstruction in a few minutes per frame: on the hardware side using fast processors and a parallel architecture, and on the software side, using efficient implementations of computationally less intensive algorithms. Execution times obtained for the PRT-1 data set on a parallel system of five hybrid nodes, each combining an Alpha processor for computation and a transputer for communication, are the following (256 sinograms of 96 views by 128 radial samples): Ramp algorithm 56 s, Favor 81 s and reprojection algorithm of Kinahan and Rogers 187 s. The implementation of fast rebinning algorithms has shown our hardware platform to become communications-limited; they execute faster on a conventional single-processor Alpha workstation: single-slice rebinning 7 s, Fourier rebinning 22 s, 2D filtered backprojection 5 s. The scalability of the system has been demonstrated, and a saturation effect at network sizes above ten nodes has become visible; new T9000-based products lifting most of the constraints on network topology and link throughput are expected to result in improved parallel efficiency and scalability properties

  13. Fast ℓ1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime (United States)

    Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael


    We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529

  14. Fast l₁-SPIRiT compressed sensing parallel imaging MRI: scalable parallel implementation and clinically feasible runtime. (United States)

    Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael


    We present l₁-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative self-consistent parallel imaging (SPIRiT). Like many iterative magnetic resonance imaging reconstructions, l₁-SPIRiT's image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing l₁-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of l₁-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT spoiled gradient echo (SPGR) sequence with up to 8× acceleration via Poisson-disc undersampling in the two phase-encoded directions.

  15. A Scalable O(N) Algorithm for Large-Scale Parallel First-Principles Molecular Dynamics Simulations

    Energy Technology Data Exchange (ETDEWEB)

    Osei-Kuffuor, Daniel [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Fattebert, Jean-Luc [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)


    Traditional algorithms for first-principles molecular dynamics (FPMD) simulations only gain a modest capability increase from current petascale computers, due to their O(N3) complexity and their heavy use of global communications. To address this issue, we are developing a truly scalable O(N) complexity FPMD algorithm, based on density functional theory (DFT), which avoids global communications. The computational model uses a general nonorthogonal orbital formulation for the DFT energy functional, which requires knowledge of selected elements of the inverse of the associated overlap matrix. We present a scalable algorithm for approximately computing selected entries of the inverse of the overlap matrix, based on an approximate inverse technique, by inverting local blocks corresponding to principal submatrices of the global overlap matrix. The new FPMD algorithm exploits sparsity and uses nearest neighbor communication to provide a computational scheme capable of extreme scalability. Accuracy is controlled by the mesh spacing of the finite difference discretization, the size of the localization regions in which the electronic orbitals are confined, and a cutoff beyond which the entries of the overlap matrix can be omitted when computing selected entries of its inverse. We demonstrate the algorithm's excellent parallel scaling for up to O(100K) atoms on O(100K) processors, with a wall-clock time of O(1) minute per molecular dynamics time step.

  16. SCORPIO: A Scalable Two-Phase Parallel I/O Library With Application To A Large Scale Subsurface Simulator

    Energy Technology Data Exchange (ETDEWEB)

    Sreepathi, Sarat [ORNL; Sripathi, Vamsi [Intel Corporation; Mills, Richard T [ORNL; Hammond, Glenn [Pacific Northwest National Laboratory (PNNL); Mahinthakumar, Kumar [North Carolina State University


    Inefficient parallel I/O is known to be a major bottleneck among scientific applications employed on supercomputers as the number of processor cores grows into the thousands. Our prior experience indicated that parallel I/O libraries such as HDF5 that rely on MPI-IO do not scale well beyond 10K processor cores, especially on parallel file systems (like Lustre) with single point of resource contention. Our previous optimization efforts for a massively parallel multi-phase and multi-component subsurface simulator (PFLOTRAN) led to a two-phase I/O approach at the application level where a set of designated processes participate in the I/O process by splitting the I/O operation into a communication phase and a disk I/O phase. The designated I/O processes are created by splitting the MPI global communicator into multiple sub-communicators. The root process in each sub-communicator is responsible for performing the I/O operations for the entire group and then distributing the data to rest of the group. This approach resulted in over 25X speedup in HDF I/O read performance and 3X speedup in write performance for PFLOTRAN at over 100K processor cores on the ORNL Jaguar supercomputer. This research describes the design and development of a general purpose parallel I/O library, SCORPIO (SCalable block-ORiented Parallel I/O) that incorporates our optimized two-phase I/O approach. The library provides a simplified higher level abstraction to the user, sitting atop existing parallel I/O libraries (such as HDF5) and implements optimized I/O access patterns that can scale on larger number of processors. Performance results with standard benchmark problems and PFLOTRAN indicate that our library is able to maintain the same speedups as before with the added flexibility of being applicable to a wider range of I/O intensive applications.

  17. Task-Oriented Gaming for Transfer to Prosthesis Use

    NARCIS (Netherlands)

    van Dijk, Ludger; Sluis, van der Corry K.; van Dijk, Hylke W.; Bongers, Raoul M.


    The aim of this study is to establish the effect of task-oriented video gaming on using a myoelectric prosthesis in a basic activity of daily life (ADL). Forty-one able-bodied right-handed participants were randomly assigned to one of four groups. In three of these groups the participants trained to

  18. Parallel O(N) Stokes' solver towards scalable Brownian dynamics of hydrodynamically interacting objects in general geometries (United States)

    Zhao, Xujun; Li, Jiyuan; Jiang, Xikai; Karpeev, Dmitry; Heinonen, Olle; Smith, Barry; Hernandez-Ortiz, Juan P.; de Pablo, Juan J.


    An efficient parallel Stokes' solver has been developed for complete description of hydrodynamic interactions between Brownian particles in bulk and confined geometries. A Langevin description of the particle dynamics is adopted, where the long-range interactions are included using a Green's function formalism. A scalable parallel computational approach is presented, where the general geometry Stokeslet is calculated following a matrix-free algorithm using the general geometry Ewald-like method. Our approach employs a highly efficient iterative finite-element Stokes' solver for the accurate treatment of long-range hydrodynamic interactions in arbitrary confined geometries. A combination of mid-point time integration of the Brownian stochastic differential equation, the parallel Stokes' solver, and a Chebyshev polynomial approximation for the fluctuation-dissipation theorem leads to an O(N) parallel algorithm. We illustrate the new algorithm in the context of the dynamics of confined polymer solutions under equilibrium and non-equilibrium conditions. The method is then extended to treat suspended finite size particles of arbitrary shape in any geometry using an immersed boundary approach.

  19. Center for Programming Models for Scalable Parallel Computing - Towards Enhancing OpenMP for Manycore and Heterogeneous Nodes

    Energy Technology Data Exchange (ETDEWEB)

    Barbara Chapman


    OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.

  20. PetClaw: A scalable parallel nonlinear wave propagation solver for Python

    KAUST Repository

    Alghamdi, Amal


    We present PetClaw, a scalable distributed-memory solver for time-dependent nonlinear wave propagation. PetClaw unifies two well-known scientific computing packages, Clawpack and PETSc, using Python interfaces into both. We rely on Clawpack to provide the infrastructure and kernels for time-dependent nonlinear wave propagation. Similarly, we rely on PETSc to manage distributed data arrays and the communication between them.We describe both the implementation and performance of PetClaw as well as our challenges and accomplishments in scaling a Python-based code to tens of thousands of cores on the BlueGene/P architecture. The capabilities of PetClaw are demonstrated through application to a novel problem involving elastic waves in a heterogeneous medium. Very finely resolved simulations are used to demonstrate the suppression of shock formation in this system.

  1. Scalability of preconditioners as a strategy for parallel computation of compressible fluid flow

    Energy Technology Data Exchange (ETDEWEB)

    Hansen, G.A.


    Parallel implementations of a Newton-Krylov-Schwarz algorithm are used to solve a model problem representing low Mach number compressible fluid flow over a backward-facing step. The Mach number is specifically selected to result in a numerically {open_quote}stiff{close_quotes} matrix problem, based on an implicit finite volume discretization of the compressible 2D Navier-Stokes/energy equations using primitive variables. Newton`s method is used to linearize the discrete system, and a preconditioned Krylov projection technique is used to solve the resulting linear system. Domain decomposition enables the development of a global preconditioner via the parallel construction of contributions derived from subdomains. Formation of the global preconditioner is based upon additive and multiplicative Schwarz algorithms, with and without subdomain overlap. The degree of parallelism of this technique is further enhanced with the use of a matrix-free approximation for the Jacobian used in the Krylov technique (in this case, GMRES(k)). Of paramount interest to this study is the implementation and optimization of these techniques on parallel shared-memory hardware, namely the Cray C90 and SGI Challenge architectures. These architectures were chosen as representative and commonly available to researchers interested in the solution of problems of this type. The Newton-Krylov-Schwarz solution technique is increasingly being investigated for computational fluid dynamics (CFD) applications due to the advantages of full coupling of all variables and equations, rapid non-linear convergence, and moderate memory requirements. A parallel version of this method that scales effectively on the above architectures would be extremely attractive to practitioners, resulting in efficient, cost-effective, parallel solutions exhibiting the benefits of the solution technique.

  2. Scalable Algorithms for Parallel Discrete Event Simulation Systems in Multicore Environments (United States)


    E. Kulawik, Charles L. Seitz, Jakov N. Seizovic, and Wen-King Su. Myrinet – a Gigabit-per-second Local-Area Network. IEEE Micro, 15(1):29–36...pages 765–770. Society for Computer Simulation, December 1989. [63] P. Reynolds and P. Dickens . SPECTRUM: A parallel simulation testbed. In The

  3. Massively Parallel and Scalable Implicit Time Integration Algorithms for Structural Dynamics (United States)

    Farhat, Charbel


    Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because of the following additional facts: (a) explicit schemes are easier to parallelize than implicit ones, and (b) explicit schemes induce short range interprocessor communications that are relatively inexpensive, while the factorization methods used in most implicit schemes induce long range interprocessor communications that often ruin the sought-after speed-up. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet be offset by the speed of the currently available parallel hardware. Therefore, it is essential to develop efficient alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating the low-frequency dynamics of aerospace structures.

  4. Scalable High-Performance Parallel Design for Network Intrusion Detection Systems on Many-Core Processors


    Jiang, Hayang; Xie, Gaogang; Salamatian, Kavé; Mathy, Laurent


    Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. Both hardware accelerated and parallel software-based NIDS solutions, based on commodity multi-core and GPU processors, have been proposed to overcome these challenges. Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. ...

  5. Race: A scalable and elastic parallel system for discovering repeats in very long sequences

    KAUST Repository

    Mansour, Essam


    A wide range of applications, including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. The problem of finding maximal pairs subsumes most important types of repetition-finding tasks. Existing solutions require both the input sequence and its index (typically an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences. In this paper we present RACE, a parallel system for finding maximal pairs in very long sequences. RACE supports parallel execution on stand-alone multicore systems, in addition to scaling to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that allows for cache-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider\\'s prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a typical desktop computer with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.

  6. ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

    KAUST Repository

    Sahli, Majed


    Modern applications, including bioinformatics, time series, and web log analysis, require the extraction of frequent patterns, called motifs, from one very long (i.e., several gigabytes) sequence. Existing approaches are either heuristics that are error-prone, or exact (also called combinatorial) methods that are extremely slow, therefore, applicable only to very small sequences (i.e., in the order of megabytes). This paper presents ACME, a combinatorial approach that scales to gigabyte-long sequences and is the first to support supermaximal motifs. ACME is a versatile parallel system that can be deployed on desktop multi-core systems, or on thousands of CPUs in the cloud. However, merely using more compute nodes does not guarantee efficiency, because of the related overheads. To this end, ACME introduces an automatic tuning mechanism that suggests the appropriate number of CPUs to utilize, in order to meet the user constraints in terms of run time, while minimizing the financial cost of cloud resources. Our experiments show that, compared to the state of the art, ACME supports three orders of magnitude longer sequences (e.g., DNA for the entire human genome); handles large alphabets (e.g., English alphabet for Wikipedia); scales out to 16,384 CPUs on a supercomputer; and supports elastic deployment in the cloud.

  7. The Parallel SBAS-DInSAR algorithm: an effective and scalable tool for Earth's surface displacement retrieval (United States)

    Zinno, Ivana; De Luca, Claudio; Elefante, Stefano; Imperatore, Pasquale; Manunta, Michele; Casu, Francesco


    Differential Synthetic Aperture Radar Interferometry (DInSAR) is an effective technique to estimate and monitor ground displacements with centimetre accuracy [1]. In the last decade, advanced DInSAR algorithms, such as the Small Baseline Subset (SBAS) [2] one that is aimed at following the temporal evolution of the ground deformation, showed to be significantly useful remote sensing tools for the geoscience communities as well as for those related to hazard monitoring and risk mitigation. DInSAR scenario is currently characterized by the large and steady increasing availability of huge SAR data archives that have a broad range of diversified features according to the characteristics of the employed sensor. Indeed, besides the old generation sensors, that include ERS, ENVISAT and RADARSAT systems, the new X-band generation constellations, such as COSMO-SkyMed and TerraSAR-X, have permitted an overall study of ground deformations with an unprecedented detail thanks to their improved spatial resolution and reduced revisit time. Furthermore, the incoming ESA Sentinel-1 SAR satellite is characterized by a global coverage acquisition strategy and 12-day revisit time and, therefore, will further contribute to improve deformation analyses and monitoring capabilities. However, in this context, the capability to process such huge SAR data archives is strongly limited by the existing DInSAR algorithms, which are not specifically designed to exploit modern high performance computational infrastructures (e.g. cluster, grid and cloud computing platforms). The goal of this paper is to present a Parallel version of the SBAS algorithm (P-SBAS) which is based on a dual-level parallelization approach and embraces combined parallel strategies [3], [4]. A detailed description of the P-SBAS algorithm will be provided together with a scalability analysis focused on studying its performances. In particular, a P-SBAS scalability analysis with respect to the number of exploited CPUs has

  8. Investigation of Ego and Task Orientation among International Wrestling Referees

    Directory of Open Access Journals (Sweden)

    I. Barbas


    Full Text Available Aim: study was to investigate any possible effect(s of experiences from active membership and participation in task or ego orientations among referees in the sport of wrestling. Material: The sample consisted of 213 international referees from 30 countries (Greece, Turkey, Bulgaria, France, Italy, Germany, Sweden, Finland, Switzerland, Russia, Poland, Hungary, U.S.A, Ukraine, Armenia, Azerbaijan, Iran, Japan, Korea, Mongolia, Kazakhstan, Egypt, Canada, Georgia, Croatia, Uzbekistan, Norway, Cuba, Belarus, & Tunisia. Their age ranged from 26 to 60 yrs. old ( M =43, SD =8.6. During the procedure, the participants were asked to fill a specific questionnaire, the «Task and Ego Orientation in Sport Questionnaire» (Duda & Nicholls, 1992. Results: Results showed that the referees from elite wrestling level’ countries (Russia, Azerbaijan, Iran, Turkey, Georgia, Armenia, Bulgaria, Ukraine, U.S.A., Korea, Japan, Kazakhstan, & Cuba are more task oriented than those from the non-elite wrestling level’ countries. Researchers believe that this occurred because referees from non-elite wrestling level’ countries might have less game-sport experience and more specifically in high level games. At the same time, the Olympic experience referees were more task oriented than the non-Olympic experienced. Conclusion: Referee’s decisions are an important issue in the sport milieu. The investigations in decision-making by referees and factors that affect it are rather scarce and research should focus on such topics. Improvement of decision-making by referees, would lead to safer and better performance. Thus, better understanding of referees’ behavior, through identification and operationalization of the factors affecting it, might lead to more effective selection, training and performance.

  9. Task-Oriented Internet Assisted English Teaching and Learning in Colleges (United States)

    Zhang, Juwu


    Task-Oriented Internet Assisted English Teaching and Learning (TIAETL) is a new English teaching and learning model which integrates the Internet-assisted and task-oriented teaching. This article analyzed the worldwide tendency of English teaching and prerequisites for TIAETL in colleges. The TIAETL has the following advantages:…

  10. Task-Oriented and Relationship-Building Communications between Air Traffic Controllers and Pilots

    Directory of Open Access Journals (Sweden)

    Inwon Kang


    Full Text Available By questioning the lopsided attention on task-oriented factors in air traffic controller-pilot communication, the current study places an equal weighting on both task-oriented and relationship-building communications, and investigates how each type of communication influences sustainable performance in airline operation team. Results show that both task-oriented and relationship-building communications in terms of sustainability of team process predicted greater communication satisfaction at work. Also, both task interdependence and shared leadership influenced both types of air traffic controller-pilot communication. However, only relationship-building communication had a direct influence on perceived work performance whereas task-oriented communication had not. Along with task-oriented factors, this study raises the relationship-oriented factors as important resources for the sustainable team performance in airline industry.

  11. A proposed scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC detectors

    International Nuclear Information System (INIS)

    Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C.; Lockyer, N.; Vanberg, R.


    A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequence, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a proposed new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of Gigabytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the proposed Scalable Parallel Open Architecture data acquisition system are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build a prototype of the proposed data acquisition system architecture is given in the paper. The major component of the system, a self-routing parallel event builder, is described in detail

  12. Scalable, incremental learning with MapReduce parallelization for cell detection in high-resolution 3D microscopy data

    KAUST Repository

    Sung, Chul


    Accurate estimation of neuronal count and distribution is central to the understanding of the organization and layout of cortical maps in the brain, and changes in the cell population induced by brain disorders. High-throughput 3D microscopy techniques such as Knife-Edge Scanning Microscopy (KESM) are enabling whole-brain survey of neuronal distributions. Data from such techniques pose serious challenges to quantitative analysis due to the massive, growing, and sparsely labeled nature of the data. In this paper, we present a scalable, incremental learning algorithm for cell body detection that can address these issues. Our algorithm is computationally efficient (linear mapping, non-iterative) and does not require retraining (unlike gradient-based approaches) or retention of old raw data (unlike instance-based learning). We tested our algorithm on our rat brain Nissl data set, showing superior performance compared to an artificial neural network-based benchmark, and also demonstrated robust performance in a scenario where the data set is rapidly growing in size. Our algorithm is also highly parallelizable due to its incremental nature, and we demonstrated this empirically using a MapReduce-based implementation of the algorithm. We expect our scalable, incremental learning approach to be widely applicable to medical imaging domains where there is a constant flux of new data. © 2013 IEEE.

  13. Conversational interfaces for task-oriented spoken dialogues: design aspects influencing interaction quality

    NARCIS (Netherlands)

    Niculescu, A.I.


    This dissertation focuses on the design and evaluation of speech-based conversational interfaces for task-oriented dialogues. Conversational interfaces are software programs enabling interaction with computer devices through natural language dialogue. Even though processing conversational speech is

  14. A Digital Sine-Weighted switched-Gm mixer for Single-Clock Power-Scalable parallel receivers

    NARCIS (Netherlands)

    Kasri, Reda; Klumperink, Eric A.M.; Cathelin, Philippe; Tournier, Eric; Nauta, Bram


    Abstract— This paper presents a mixed A/D architecture for parallel channelized RF receiver applications. Its power consumption scales with the number of active receivers and hence with the available overall data rate. A digital sine-weighted switched-Gm mixer with a DDFS per channel is proposed as

  15. BlockSolve v1. 1: Scalable library software for the parallel solution of sparse linear systems

    Energy Technology Data Exchange (ETDEWEB)

    Jones, M.T.; Plassmann, P.E.


    BlockSolve is a software library for solving large, sparse systems of linear equations on massively parallel computers. The matrices must be symmetric, but may have an arbitrary sparsity structure. BlockSolve is a portable package that is compatible with several different message-passing pardigms. This report gives detailed instructions on the use of BlockSolve in applications programs.

  16. Numerically stable, scalable formulas for parallel and online computation of higher-order multivariate central moments with arbitrary weights

    Energy Technology Data Exchange (ETDEWEB)

    Pebay, Philippe [Sandia National Laboratories (SNL-CA), Livermore, CA (United States); Terriberry, Timothy B. [Xiph.Org Foundation, Arlington, VA (United States); Kolla, Hemanth [Sandia National Laboratories (SNL-CA), Livermore, CA (United States); Bennett, Janine [Sandia National Laboratories (SNL-CA), Livermore, CA (United States)


    Formulas for incremental or parallel computation of second order central moments have long been known, and recent extensions of these formulas to univariate and multivariate moments of arbitrary order have been developed. Formulas such as these, are of key importance in scenarios where incremental results are required and in parallel and distributed systems where communication costs are high. We survey these recent results, and improve them with arbitrary-order, numerically stable one-pass formulas which we further extend with weighted and compound variants. We also develop a generalized correction factor for standard two-pass algorithms that enables the maintenance of accuracy over nearly the full representable range of the input, avoiding the need for extended-precision arithmetic. We then empirically examine algorithm correctness for pairwise update formulas up to order four as well as condition number and relative error bounds for eight different central moment formulas, each up to degree six, to address the trade-offs between numerical accuracy and speed of the various algorithms. Finally, we demonstrate the use of the most elaborate among the above mentioned formulas, with the utilization of the compound moments for a practical large-scale scientific application.

  17. Portable and Transparent Message Compression in MPI Libraries to Improve the Performance and Scalability of Parallel Applications

    Energy Technology Data Exchange (ETDEWEB)

    Albonesi, David; Burtscher, Martin


    The goal of this project has been to develop a lossless compression algorithm for message-passing libraries that can accelerate HPC systems by reducing the communication time. Because both compression and decompression have to be performed in software in real time, the algorithm has to be extremely fast while still delivering a good compression ratio. During the first half of this project, they designed a new compression algorithm called FPC for scientific double-precision data, made the source code available on the web, and published two papers describing its operation, the first in the proceedings of the Data Compression Conference and the second in the IEEE Transactions on Computers. At comparable average compression ratios, this algorithm compresses and decompresses 10 to 100 times faster than BZIP2, DFCM, FSD, GZIP, and PLMI on the three architectures tested. With prediction tables that fit into the CPU's L1 data acache, FPC delivers a guaranteed throughput of six gigabits per second on a 1.6 GHz Itanium 2 system. The C source code and documentation of FPC are posted on-line and have already been downloaded hundreds of times. To evaluate FPC, they gathered 13 real-world scientific datasets from around the globe, including satellite data, crash-simulation data, and messages from HPC systems. Based on the large number of requests they received, they also made these datasets available to the community (with permission of the original sources). While FPC represents a great step forward, it soon became clear that its throughput was too slow for the emerging 10 gigabits per second networks. Hence, no speedup can be gained by including this algorithm in an MPI library. They therefore changed the aim of the second half of the project. Instead of implementing FPC in an MPI library, they refocused their efforts to develop a parallel compression algorithm to further boost the throughput. After all, all modern high-end microprocessors contain multiple CPUs on a

  18. Coupling graph perturbation theory with scalable parallel algorithms for large-scale enumeration of maximal cliques in biological graphs

    International Nuclear Information System (INIS)

    Samatova, N F; Schmidt, M C; Hendrix, W; Breimyer, P; Thomas, K; Park, B-H


    Data-driven construction of predictive models for biological systems faces challenges from data intensity, uncertainty, and computational complexity. Data-driven model inference is often considered a combinatorial graph problem where an enumeration of all feasible models is sought. The data-intensive and the NP-hard nature of such problems, however, challenges existing methods to meet the required scale of data size and uncertainty, even on modern supercomputers. Maximal clique enumeration (MCE) in a graph derived from such biological data is often a rate-limiting step in detecting protein complexes in protein interaction data, finding clusters of co-expressed genes in microarray data, or identifying clusters of orthologous genes in protein sequence data. We report two key advances that address this challenge. We designed and implemented the first (to the best of our knowledge) parallel MCE algorithm that scales linearly on thousands of processors running MCE on real-world biological networks with thousands and hundreds of thousands of vertices. In addition, we proposed and developed the Graph Perturbation Theory (GPT) that establishes a foundation for efficiently solving the MCE problem in perturbed graphs, which model the uncertainty in the data. GPT formulates necessary and sufficient conditions for detecting the differences between the sets of maximal cliques in the original and perturbed graphs and reduces the enumeration time by more than 80% compared to complete recomputation

  19. Conversational Interfaces: A Domain-Independent Architecture for Task-Oriented Dialogues (United States)


    738.4 Avoiding Mode Confusion . . . . . . . . . . . . . . . . 738.4.1 Announ ing State Changes ...involved insupporting a new task-oriented domain for dialogue. In parti ular, Iwill assume that the goal is to modify the CSLI dialogue mananger towork with

  20. [The curative effect of task-oriented approach in combination with articulation approach on spastic dysarthria]. (United States)

    Zhu, Shou-Juan; Qu, Yun; Liu, Ke


    To assess the effectiveness of task-oriented approach in treating patients with spastic dysarthria. A randomized control trial was undertaken in 44 inpatients diagnosed with spastic dysarthria at the Department of Rehabilitation Medicine in West China Hospital. All of the participants received basic medical therapy, occupational therapy, physical therapy, and an articulation approach for speech therapy by a professional speech therapist over a one month period. A task-oriented approach was added to the speech therapy regime of the test group of participants by another professional speech therapist over the same period of time. The outcomes of speech therapy were measured by the Frenchay dysarthria assessment (FDA). Significant improvements were found in the test group in relation to 15 FDA items, such as dribbling, lips spread, and palate maintenance (P0. 05) except for the three items in relation to cough, lips at rest and jaw in speech. Task-oriented approach for speech therapy is effective in treating patients with spastic dysarthria. A task-oriented approach in combination with an articulation approach can produce better patient outcomes compared with the articulation approach alone. Further studies are warranted.

  1. Effectiveness and feasibility of eccentric and task-oriented strength training in individuals with stroke

    NARCIS (Netherlands)

    Folkerts, Mireille A; Hijmans, Juha M.; Elsinghorst, Anne L.; Mulderij, Yvon; Murgia, Alessio; Dekker, Rienk


    BACKGROUND: Strength training can increase function in individuals with stroke. However it is unclear which type of strength training is most effective and feasible. OBJECTIVE: To assess the effect and feasibility of an intervention combining eccentric and task-oriented strength training in

  2. Task-Oriented Spoken Dialog System for Second-Language Learning (United States)

    Kwon, Oh-Woog; Kim, Young-Kil; Lee, Yunkeun


    This paper introduces a Dialog-Based Computer Assisted second-Language Learning (DB-CALL) system using task-oriented dialogue processing technology. The system promotes dialogue with a second-language learner for a specific task, such as purchasing tour tickets, ordering food, passing through immigration, etc. The dialog system plays a role of a…

  3. Task-oriented control of Single-Master Multi-Slave Manipulator System

    International Nuclear Information System (INIS)

    Kosuge, Kazuhiro; Ishikawa, Jun; Furuta, Katsuhisa; Hariki, Kazuo; Sakai, Masaru.


    A master-slave manipulator system, in general, consists of a master arm manipulated by a human and a slave arm used for real tasks. Some tasks, such as manipulation of a heavy object, etc., require two or more slave arms operated simultaneously. A Single-Master Multi-Slave Manipulator System consists of a master arm with six degrees of freedom and two or more slave arms, each of which has six or more degrees of freedom. In this system, a master arm controls the task-oriented variables using Virtual Internal Model (VIM) based on the concept of 'Task-Oriented Control'. VIM is a reference model driven by sensory information and used to describe the desired relation between the motion of a master arm and task-oriented variables. The motion of slave arms are controlled based on the task oriented variables generated by VIM and tailors the system to meet specific tasks. A single-master multi-slave manipulator system, having two slave arms, is experimentally developed and illustrates the concept. (author)

  4. Task oriented training improves the balance outcome & reducing fall risk in diabetic population. (United States)

    Ghazal, Javeria; Malik, Arshad Nawaz; Amjad, Imran


    The objective was to determine the balance impairments and to compare task oriented versus traditional balance training in fall reduction among diabetic patients. The randomized control trial with descriptive survey and 196 diabetic patients were recruited to assess balance impairments through purposive sampling technique. Eighteen patients were randomly allocated into two groups; task oriented balance training group TOB (n=8) and traditional balance training group TBT (n=10). The inclusion criteria were 30-50 years age bracket and diagnosed cases of Diabetes Mellitus with neuropathy. The demographics were taken through standardized & valid assessment tools include Berg Balance Scale and Functional Reach Test. The measurements were obtained at baseline, after 04 and 08 weeks of training. The mean age of the participants was 49 ±6.79. The result shows that 165(84%) were at moderate risk of fall and 31(15%) were at mild risk of fall among total 196 diabetic patients. There was significant improvement (p training group for dynamic balance, anticipatory balance and reactive balance after 8 weeks of training as compare to traditional balance training. Task oriented balance training is effective in improving the dynamic, anticipator and reactive balance. The task oriented training reduces the risk of falling through enhancing balance outcome.

  5. Effects of Task-Oriented Circuit Class Training on Walking Competency After Stroke A Systematic Review

    NARCIS (Netherlands)

    Wevers, Lotte; van de Port, Ingrid; Vermue, Mathijs; Mead, Gillian; Kwakkel, Gert

    Background and Purpose-There is increasing interest in the potential benefits of circuit class training after stroke, but its effectiveness is uncertain. Our aim was to systematically review randomized, controlled trials of task-oriented circuit class training on gait and gait-related activities in

  6. Task Oriented Hydrogeological Site Characterization: A Framework and Case Study in Predicting Contaminant Arrival Time (United States)

    Harken, B.; Schneidewind, U.; Kalbacher, T.; Rubin, Y.; Dietrich, P.


    Hydrogeological site characterization requirements vary with different modeling and risk assessment goals. For example, predicting contaminant transport requires characterization of the spatial structure and variability of hydraulic conductivity. In light of the costs associated with site characterization, a framework for task oriented field campaign design would be advantageous. The current work highlights a general framework for task oriented characterization design aimed at balancing cost with risk and uncertainty. Also presented is a case study focusing on predicting contaminant arrival time, which often depends on characterizing the spatial distribution of high-permeability zones. The case study highlights how the framework can enable the most efficient use of new characterization technologies which can improve modeling campaigns by providing measurements of hydraulic conductivity with high spatial resolution.

  7. A Network-based End-to-End Trainable Task-oriented Dialogue System


    Wen, Tsung-Hsien; Vandyke, David; Mrksic, Nikola; Gasic, Milica; Rojas-Barahona, Lina M.; Su, Pei-Hao; Ultes, Stefan; Young, Steve


    Teaching machines to accomplish tasks by conversing naturally with humans is challenging. Currently, developing task-oriented dialogue systems requires creating multiple components and typically this involves either a large amount of handcrafting, or acquiring costly labelled datasets to solve a statistical learning problem for each component. In this work we introduce a neural network-based text-in, text-out end-to-end trainable goal-oriented dialogue system along with a new way of collectin...

  8. Scalable coherent interface

    International Nuclear Information System (INIS)

    Alnaes, K.; Kristiansen, E.H.; Gustavson, D.B.; James, D.V.


    The Scalable Coherent Interface (IEEE P1596) is establishing an interface standard for very high performance multiprocessors, supporting a cache-coherent-memory model scalable to systems with up to 64K nodes. This Scalable Coherent Interface (SCI) will supply a peak bandwidth per node of 1 GigaByte/second. The SCI standard should facilitate assembly of processor, memory, I/O and bus bridge cards from multiple vendors into massively parallel systems with throughput far above what is possible today. The SCI standard encompasses two levels of interface, a physical level and a logical level. The physical level specifies electrical, mechanical and thermal characteristics of connectors and cards that meet the standard. The logical level describes the address space, data transfer protocols, cache coherence mechanisms, synchronization primitives and error recovery. In this paper we address logical level issues such as packet formats, packet transmission, transaction handshake, flow control, and cache coherence. 11 refs., 10 figs

  9. Task oriented design of robot kinematics using grid method and its application to nuclear power plant

    International Nuclear Information System (INIS)

    Chang, Pyung-Hun; Park, Joon-Young


    This paper presents a Task Oriented Design method for robot kinematics based on grid method, widely used in finite difference method and heat transfer/fluid flow analyses. This approach drastically reduces complexities and computational burden due to previous approaches. More specifically, the grid method with a new formulation simplifies the design to a problem of three-design-variable unit grid, which does not require to solve inverse/forward kinematics. The effectiveness of the grid method has been confirmed through a kinematics design of a robot for nuclear power plants. (author)

  10. Discourse pragmatics and ellipsis resolution in task-oriented natural language interfaces

    Energy Technology Data Exchange (ETDEWEB)

    Carbonell, J.G.


    The author reviews discourse phenomena that occur frequently in task-oriented man-machine dialogs, reporting on an empirical study that demonstrates the necessity of handling ellipsis, anaphora, extragrammaticality, intersentential metalanguage, and other abbreviatory devices in order to achieve convivial user interaction. Invariably, users prefer to generate terse or fragmentary utterances instead of longer, more complete stand-alone expressions, even when given clear instructions to the contrary. The xcalibur expert system interface is designed to meet these needs, including generalized ellipsis resolution by means of a rule-based caseframe method superior to previous semantic grammar approaches. 23 references.

  11. Effects of virtual reality-based training and task-oriented training on balance performance in stroke patients. (United States)

    Lee, Hyung Young; Kim, You Lim; Lee, Suk Min


    [Purpose] This study aimed to investigate the clinical effects of virtual reality-based training and task-oriented training on balance performance in stroke patients. [Subjects and Methods] The subjects were randomly allocated to 2 groups: virtual reality-based training group (n = 12) and task-oriented training group (n = 12). The patients in the virtual reality-based training group used the Nintendo Wii Fit Plus, which provided visual and auditory feedback as well as the movements that enabled shifting of weight to the right and left sides, for 30 min/day, 3 times/week for 6 weeks. The patients in the task-oriented training group practiced additional task-oriented programs for 30 min/day, 3 times/week for 6 weeks. Patients in both groups also underwent conventional physical therapy for 60 min/day, 5 times/week for 6 weeks. [Results] Balance and functional reach test outcomes were examined in both groups. The results showed that the static balance and functional reach test outcomes were significantly higher in the virtual reality-based training group than in the task-oriented training group. [Conclusion] This study suggested that virtual reality-based training might be a more feasible and suitable therapeutic intervention for dynamic balance in stroke patients compared to task-oriented training.

  12. Advert saliency distracts children's visual attention during task-oriented internet use. (United States)

    Holmberg, Nils; Sandberg, Helena; Holmqvist, Kenneth


    The general research question of the present study was to assess the impact of visually salient online adverts on children's task-oriented internet use. In order to answer this question, an experimental study was constructed in which 9- and 12-year-old Swedish children were asked to solve a number of tasks while interacting with a mockup website. In each trial, web adverts in several saliency conditions were presented. By both measuring children's task accuracy, as well as the visual processing involved in solving these tasks, this study allows us to infer how two types of visual saliency affect children's attentional behavior, and whether such behavioral effects also impacts their task performance. Analyses show that low-level visual features and task relevance in online adverts have different effects on performance measures and process measures respectively. Whereas task performance is stable with regard to several advert saliency conditions, a marked effect is seen on children's gaze behavior. On the other hand, task performance is shown to be more sensitive to individual differences such as age, gender and level of gaze control. The results provide evidence about cognitive and behavioral distraction effects in children's task-oriented internet use caused by visual saliency in online adverts. The experiment suggests that children to some extent are able to compensate for behavioral effects caused by distracting visual stimuli when solving prospective memory tasks. Suggestions are given for further research into the interdiciplinary area between media research and cognitive science.

  13. The Application of Task- oriented Teaching Approach to Enhancing Communicative Competence of EFL

    Directory of Open Access Journals (Sweden)

    LI Mingxin


    Full Text Available To communicate is the primary goal of most foreign language learning (EFL. As an important component of the four macro skills (listening, speaking, reading and writing, reading should also serve this purpose. However, traditional methodology still dominates extensive reading teaching in most of the universities. To promote a communicative extensive reading class, we may start by designing various tasks and activities. This paper introduces a task-oriented approach in English extensive reading class. According to Nunan, task-oriented teaching involves learners in the classroom to comprehend, manipulate, produce or interact in the target language, but the focus is on the meaning rather than the form. In light of psycholinguistic model and schema theory model, the methodology covers information-gap activity, opinion-gap activity and reasoning-gap activity which can be run in the class. The task-approaches make the interaction between teacher and students, students and students more active and meaningful. Skills of reading to solve communicative problems are always treated consciously. This approach may hopefully result in some improvement on the teaching of English reading.

  14. Advert saliency distracts children's visual attention during task-oriented internet use

    Directory of Open Access Journals (Sweden)

    Nils eHolmberg


    Full Text Available The general research question of the present study was to assess the impact of visually salient online adverts on children's task-oriented internet use. In order to answer this question, an experimental study was constructed in which 9-year-old and 12-year-old Swedish children were asked to solve a number of tasks while interacting with a mockup website. In each trial, web adverts in several saliency conditions were presented. By both measuring children's task accuracy, as well as the visual processing involved in solving these tasks, this study allows us to infer how two types of visual saliency affect children's attentional behavior, and whether such behavioral effects also impacts their task performance. Analyses show that low-level visual features and task relevance in online adverts have different effects on performance measures and process measures respectively. Whereas task performance is stable with regard to several advert saliency conditions, a marked effect is seen on children's gaze behavior. On the other hand, task performance is shown to be more sensitive to individual differences such as age, gender and level of gaze control. The results provide evidence about cognitive and behavioral distraction effects in children's task-oriented internet use caused by visual saliency in online adverts. The experiment suggests that children to some extent are able to compensate for behavioral effects caused by distracting visual stimuli when solving prospective memory tasks. Suggestions are given for further research into the interdiciplinary area between media research and cognitive science.

  15. Scalable devices

    KAUST Repository

    Krüger, Jens J.


    In computer science in general and in particular the field of high performance computing and supercomputing the term scalable plays an important role. It indicates that a piece of hardware, a concept, an algorithm, or an entire system scales with the size of the problem, i.e., it can not only be used in a very specific setting but it\\'s applicable for a wide range of problems. From small scenarios to possibly very large settings. In this spirit, there exist a number of fixed areas of research on scalability. There are works on scalable algorithms, scalable architectures but what are scalable devices? In the context of this chapter, we are interested in a whole range of display devices, ranging from small scale hardware such as tablet computers, pads, smart-phones etc. up to large tiled display walls. What interests us mostly is not so much the hardware setup but mostly the visualization algorithms behind these display systems that scale from your average smart phone up to the largest gigapixel display walls.

  16. Effect of a Task-Oriented Rehabilitation Program on Upper Extremity Recovery Following Motor Stroke (United States)

    Winstein, Carolee J.; Wolf, Steven L.; Dromerick, Alexander W.; Lane, Christianne J.; Nelsen, Monica A.; Lewthwaite, Rebecca; Cen, Steven Yong; Azen, Stanley P.


    IMPORTANCE Clinical trials suggest that higher doses of task-oriented training are superior to current clinical practice for patients with stroke with upper extremity motor deficits. OBJECTIVE To compare the efficacy of a structured, task-oriented motor training program vs usual and customary occupational therapy (UCC) during stroke rehabilitation. DESIGN, SETTING, AND PARTICIPANTS Phase 3, pragmatic, single-blind randomized trial among 361 participants with moderate motor impairment recruited from 7 US hospitals over 44 months, treated in the outpatient setting from June 2009 to March 2014. INTERVENTIONS Structured, task-oriented upper extremity training (Accelerated Skill Acquisition Program[ASAP]; n = 119); dose-equivalent occupational therapy (DEUCC; n = 120); or monitoring-only occupational therapy (UCC; n = 122). The DEUCC group was prescribed 30 one-hour sessions over 10 weeks; the UCC group was only monitored, without specification of dose. MAIN OUTCOMES AND MEASURES The primary outcome was 12-month change in log-transformed Wolf Motor Function Test time score (WMFT, consisting of a mean of 15 timed arm movements and hand dexterity tasks). Secondary outcomes were change in WMFT time score (minimal clinically important difference [MCID] = 19 seconds) and proportion of patients improving ≥25 points on the Stroke Impact Scale (SIS) hand function score (MCID = 17.8 points). RESULTS Among the 361 randomized patients (mean age, 60.7 years; 56% men; 42% African American; mean time since stroke onset, 46 days), 304 (84%) completed the 12-month primary outcome assessment; in intention-to-treat analysis, mean group change scores (log WMFT, baseline to 12 months) were, for the ASAP group, 2.2 to 1.4 (difference, 0.82); DEUCC group, 2.0 to 1.2 (difference, 0.84); and UCC group, 2.1 to 1.4 (difference, 0.75), with no significant between-group differences (ASAP vs DEUCC:0.14; 95% CI, −0.05 to 0.33; P = .16; ASAP vs UCC: −0.01; 95% CI, −0.22 to 0.21; P = .94; and

  17. Bilateral robotic priming before task-oriented approach in subacute stroke rehabilitation: a pilot randomized controlled trial. (United States)

    Hsieh, Yu-Wei; Wu, Ching-Yi; Wang, Wei-En; Lin, Keh-Chung; Chang, Ku-Chou; Chen, Chih-Chi; Liu, Chien-Ting


    To investigate the treatment effects of bilateral robotic priming combined with the task-oriented approach on motor impairment, disability, daily function, and quality of life in patients with subacute stroke. A randomized controlled trial. Occupational therapy clinics in medical centers. Thirty-one subacute stroke patients were recruited. Participants were randomly assigned to receive bilateral priming combined with the task-oriented approach (i.e., primed group) or to the task-oriented approach alone (i.e., unprimed group) for 90 minutes/day, 5 days/week for 4 weeks. The primed group began with the bilateral priming technique by using a bimanual robot-aided device. Motor impairments were assessed by the Fugal-Meyer Assessment, grip strength, and the Box and Block Test. Disability and daily function were measured by the modified Rankin Scale, the Functional Independence Measure, and actigraphy. Quality of life was examined by the Stroke Impact Scale. The primed and unprimed groups improved significantly on most outcomes over time. The primed group demonstrated significantly better improvement on the Stroke Impact Scale strength subscale ( p = 0.012) and a trend for greater improvement on the modified Rankin Scale ( p = 0.065) than the unprimed group. Bilateral priming combined with the task-oriented approach elicited more improvements in self-reported strength and disability degrees than the task-oriented approach by itself. Further large-scale research with at least 31 participants in each intervention group is suggested to confirm the study findings.

  18. Task orientated nursing in a tuberculosis control programme in South Africa: where does it come from and what keeps it going? (United States)

    van der Walt, Hester M; Swartz, Leslie


    Task oriented nursing is associated with traditional hospital ward organisational practice. This paper describes task orientation in a tuberculosis control programme which forms part of the public health system in Cape Town, South Africa. Task oriented practice is illustrated with clinical data from a focused ethnography on the work of nurses in a tuberculosis control programme. The origins of task orientation are traced to the colonial history of nursing in South Africa. The authors explore both the explicit and more functional reasons for maintaining task orientation, as well as the implicit and mostly unconscious socially structured defences which contribute to the continuation of this form of practice. Unless attention is given to the complexities of this phenomenon, initiatives to change task oriented practice may continue to fail.

  19. Effect of position feedback during task-oriented upper-limb training after stroke: Five-case pilot study

    NARCIS (Netherlands)

    Molier, B.I.; Prange, Grada Berendina; Krabben, T.; Stienen, Arno; van der Kooij, Herman; Buurke, Jaap; Jannink, M.J.A.; Hermens, Hermanus J.


    Feedback is an important element in motor learning during rehabilitation therapy following stroke. The objective of this pilot study was to better understand the effect of position feedback during task-oriented reach training of the upper limb in people with chronic stroke. Five subjects

  20. Early Oral Language Comprehension, Task Orientation, and Foundational Reading Skills as Predictors of Grade 3 Reading Comprehension (United States)

    Lepola, Janne; Lynch, Julie; Kiuru, Noona; Laakkonen, Eero; Niemi, Pekka


    The present five-year longitudinal study from preschool to grade 3 examined the developmental associations among oral language comprehension, task orientation, reading precursors, and reading fluency, as well as their role in predicting grade 3 reading comprehension. Ninety Finnish-speaking students participated in the study. The students' oral…

  1. A Task-oriented Study on the Influencing Effects of Query-biased Summarization in Web Searching. (United States)

    White, Ryen W.; Jose, Joemon M.; Ruthven, Ian


    A task-oriented, comparative evaluation between four Web retrieval systems was performed; two using query-biased summarization, and two using the standard ranked titles/abstracts approach. Results indicate that query-biased summarization techniques appear to be more useful and effective in helping users gauge document relevance than the…

  2. Temporal motifs reveal collaboration patterns in online task-oriented networks (United States)

    Xuan, Qi; Fang, Huiting; Fu, Chenbo; Filkov, Vladimir


    Real networks feature layers of interactions and complexity. In them, different types of nodes can interact with each other via a variety of events. Examples of this complexity are task-oriented social networks (TOSNs), where teams of people share tasks towards creating a quality artifact, such as academic research papers or software development in commercial or open source environments. Accomplishing those tasks involves both work, e.g., writing the papers or code, and communication, to discuss and coordinate. Taking into account the different types of activities and how they alternate over time can result in much more precise understanding of the TOSNs behaviors and outcomes. That calls for modeling techniques that can accommodate both node and link heterogeneity as well as temporal change. In this paper, we report on methodology for finding temporal motifs in TOSNs, limited to a system of two people and an artifact. We apply the methods to publicly available data of TOSNs from 31 Open Source Software projects. We find that these temporal motifs are enriched in the observed data. When applied to software development outcome, temporal motifs reveal a distinct dependency between collaboration and communication in the code writing process. Moreover, we show that models based on temporal motifs can be used to more precisely relate both individual developer centrality and team cohesion to programmer productivity than models based on aggregated TOSNs.

  3. An evaluation of five bedside information products using a user-centered, task-oriented approach. (United States)

    Campbell, Rose; Ash, Joan


    The paper compares several bedside information tools using user-centered, task-oriented measures to assist those making or supporting purchasing decisions. Eighteen potential users were asked to attempt to answer clinical questions using five commercial products (ACP's PIER, DISEASEDEX, FIRSTConsult, InfoRetriever, and UpToDate). Users evaluated each tool for ease-of-use and user satisfaction. The average number of questions answered and user satisfaction were measured for each product. Results show no significant differences in user perceptions of content quality. However, user interaction measures (such as screen layout) show a significant preference for the UpToDate product. In addition, users found answers to significantly more questions using UpToDate. When evaluating electronic products designed for use at the point of care, the user interaction aspects of a product become as important as more traditional content-based measures of quality. Actual or potential users of such products are appropriately equipped to identify which products rate the highest on these measures.

  4. Scalable shared-memory multiprocessing

    CERN Document Server

    Lenoski, Daniel E


    Dr. Lenoski and Dr. Weber have experience with leading-edge research and practical issues involved in implementing large-scale parallel systems. They were key contributors to the architecture and design of the DASH multiprocessor. Currently, they are involved with commercializing scalable shared-memory technology.

  5. A scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC [Superconducting Super Collider] detectors

    International Nuclear Information System (INIS)

    Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C.; Lockyer, N.; VanBerg, R.


    A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequences, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of GigaBytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the system architecture are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build the self-routing parallel event builder will also be given in the paper. 3 figs., 1 tab

  6. Efficacy of Occupational Therapy Task-oriented Approach in Upper Extremity Post-stroke Rehabilitation. (United States)

    Almhdawi, Khader A; Mathiowetz, Virgil G; White, Matthew; delMas, Robert C


    There is a need for more effective rehabilitation methods for individuals post-stroke. Occupational Therapy Task-Oriented (TO) approach has not been evaluated in a randomized clinical trial. The purpose of this study was to evaluate functional and impairment efficacies of TO approach on the more-affected Upper Extremity (UE) of persons post-stroke. A randomized single-blinded cross-over trial recruited 20 participants post-stroke (mean chronicity = 62 months) who demonstrated at least 10° active more-affected shoulder flexion and abduction and elbow flexion-extension. Participants were randomized into immediate (n = 10) and delayed intervention (n = 10) groups. Immediate group had 6 weeks of 3 hr/week TO intervention followed by 6 weeks of no-intervention control. Delayed intervention group underwent the reversed order. Functional measures included Canadian Occupational Performance Measure (COPM), Motor Activity Log (MAL), and Wolf Motor Function Test (WMFT). Impairment measures included UE Active Range of Motion (AROM) and handheld dynamometry strength. Measurements were obtained at baseline, cross over, and end of the study. TO intervention showed statistically higher functional change scores. COPM performance and satisfaction scores were 2.83 and 3.46 units greater respectively (p post-stroke rehabilitation approach inducing clinically meaningful functional improvements. More studies are needed with larger samples and specific stroke chronicity and severity. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  7. Parallel rendering (United States)

    Crockett, Thomas W.


    This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.

  8. Final Report, Center for Programming Models for Scalable Parallel Computing: Co-Array Fortran, Grant Number DE-FC02-01ER25505

    Energy Technology Data Exchange (ETDEWEB)

    Robert W. Numrich


    The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to

  9. Effects of activity repetition training with Salat (prayer) versus task oriented training on functional outcomes of stroke. (United States)

    Ghous, Misbah; Malik, Arshad Nawaz; Amjad, Mian Imran; Kanwal, Maria


    Stroke is one of most disabling condition which directly affects quality of life. The objective of this study was to compare the effect of activity repetition training with salat (prayer) versus task oriented training on functional outcomes of stroke. The study design was randomized control trial and 32 patients were randomly assigned into two groups'. The stroke including infarction or haemorrhagic, age bracket 30-70 years was included. The demographics were recorded and standardized assessment tool included Berg Balance Scale (BBS), Motor assessment scale (MAS) and Time Up and Go Test (TUG). The measurements were obtained at baseline, after four and six weeks. The mean age of the patients was 54.44±10.59 years with 16 (59%) male and 11(41%) female patients. Activity Repetition Training group showed significant improvement (pfunctional status as compare to task oriented training group. The repetition with motivation and concentration is the key in re-learning process of neural plasticity.

  10. Task-Oriented and Bottle Feeding Adversely Affect the Quality of Mother-Infant Interactions Following Abnormal Newborn Screens (United States)

    Tluczek, Audrey; Clark, Roseanne; McKechnie, Anne Chevalier; Orland, Kate Murphy; Brown, Roger L.


    Objective Examine effects of newborn screening (NBS) and neonatal diagnosis on the quality of mother-infant interactions in the context of feeding. Methods Study compared the quality of mother-infant feeding interactions among four groups of infants classified by severity of NBS and diagnostic results: cystic fibrosis (CF), congenital hypothyroidism, heterozygote CF carrier, and healthy with normal NBS. The Parent-Child Early Relational Assessment and a task-oriented item measured the quality of feeding interactions for 130 dyads, infant ages 3–19 weeks (M=9.19, SD=3.28). The Center for Epidemiologic Studies Depression Scale and State-Trait Anxiety Inventory measured maternal depression and anxiety. Results Composite Indicator Structure Equation Modeling showed that infant diagnostic status and, to a lesser extent, maternal education predicted feeding method. Mothers of infants with CF were most likely to bottle feed, which was associated with more task-oriented maternal behavior than breastfeeding. Mothers with low task-oriented behavior showed more sensitivity and responsiveness to infant cues, as well as less negative affect and behavior in their interactions with their infants than mothers with high task-oriented scores. Mothers of infants with CF were significantly more likely to have clinically significant anxiety and depression than the other groups. However, maternal psychological profile did not predict feeding method or interaction quality. Conclusions Mothers in the CF group were the least likely to breastfeed. Research is needed to explicate long-term effects of feeding methods on quality of mother-child relationship and ways to promote continued breastfeeding following a neonatal CF diagnosis. PMID:20495477

  11. An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications. (United States)

    Park, Seong-Wook; Park, Junyoung; Bong, Kyeongryeol; Shin, Dongjoo; Lee, Jinmook; Choi, Sungpill; Yoo, Hoi-Jun


    Deep Learning algorithm is widely used for various pattern recognition applications such as text recognition, object recognition and action recognition because of its best-in-class recognition accuracy compared to hand-crafted algorithm and shallow learning based algorithms. Long learning time caused by its complex structure, however, limits its usage only in high-cost servers or many-core GPU platforms so far. On the other hand, the demand on customized pattern recognition within personal devices will grow gradually as more deep learning applications will be developed. This paper presents a SoC implementation to enable deep learning applications to run with low cost platforms such as mobile or portable devices. Different from conventional works which have adopted massively-parallel architecture, this work adopts task-flexible architecture and exploits multiple parallelism to cover complex functions of convolutional deep belief network which is one of popular deep learning/inference algorithms. In this paper, we implement the most energy-efficient deep learning and inference processor for wearable system. The implemented 2.5 mm × 4.0 mm deep learning/inference processor is fabricated using 65 nm 8-metal CMOS technology for a battery-powered platform with real-time deep inference and deep learning operation. It consumes 185 mW average power, and 213.1 mW peak power at 200 MHz operating frequency and 1.2 V supply voltage. It achieves 411.3 GOPS peak performance and 1.93 TOPS/W energy efficiency, which is 2.07× higher than the state-of-the-art.

  12. Effects of Task-oriented Approach on Affected Arm Function in Children with Spastic Hemiplegia Due to Cerebral Palsy. (United States)

    Song, Chiang-Soon


    [Purpose] The purpose of the present study was to examine the effects of task-oriented approach on motor function of the affected arm in children with spastic hemiplegia due to cerebral palsy. [Subjects] Twelve children were recruited by convenience sampling from 2 local rehabilitation centers. The present study utilized a one-group pretest-posttest design. All of children received task-oriented training for 6 weeks (40 min/day, 5 days/week) and also underwent regular occupational therapy. Three clinical tests, Box and Block Test (BBT), Manual Ability Measure (MAM-16), and Wee Functional Independence Measure (WeeFIM) were performed 1 day before and after training to evaluate the effects of the training. [Results] Compared with the pretest scores, there was a significant increase in the BBT, MAM-16, and WeeFIM scores of the children after the 6-week practice period. [Conclusion] The results of this study suggest that a task-oriented approach to treatment of the affected arm improves functional activities, such as manual dexterity and fine motor performance, as well as basic daily activities of patients with spastic hemiplegia due to cerebral palsy.

  13. Scalable algorithms for contact problems

    CERN Document Server

    Dostál, Zdeněk; Sadowská, Marie; Vondrák, Vít


    This book presents a comprehensive and self-contained treatment of the authors’ newly developed scalable algorithms for the solutions of multibody contact problems of linear elasticity. The brand new feature of these algorithms is theoretically supported numerical scalability and parallel scalability demonstrated on problems discretized by billions of degrees of freedom. The theory supports solving multibody frictionless contact problems, contact problems with possibly orthotropic Tresca’s friction, and transient contact problems. It covers BEM discretization, jumping coefficients, floating bodies, mortar non-penetration conditions, etc. The exposition is divided into four parts, the first of which reviews appropriate facets of linear algebra, optimization, and analysis. The most important algorithms and optimality results are presented in the third part of the volume. The presentation is complete, including continuous formulation, discretization, decomposition, optimality results, and numerical experimen...

  14. Scalable Domain Decomposed Monte Carlo Particle Transport

    Energy Technology Data Exchange (ETDEWEB)

    O' Brien, Matthew Joseph [Univ. of California, Davis, CA (United States)


    In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation.

  15. Scalable Nonlinear Compact Schemes

    Energy Technology Data Exchange (ETDEWEB)

    Ghosh, Debojyoti [Argonne National Lab. (ANL), Argonne, IL (United States); Constantinescu, Emil M. [Univ. of Chicago, IL (United States); Brown, Jed [Univ. of Colorado, Boulder, CO (United States)


    In this work, we focus on compact schemes resulting in tridiagonal systems of equations, specifically the fifth-order CRWENO scheme. We propose a scalable implementation of the nonlinear compact schemes by implementing a parallel tridiagonal solver based on the partitioning/substructuring approach. We use an iterative solver for the reduced system of equations; however, we solve this system to machine zero accuracy to ensure that no parallelization errors are introduced. It is possible to achieve machine-zero convergence with few iterations because of the diagonal dominance of the system. The number of iterations is specified a priori instead of a norm-based exit criterion, and collective communications are avoided. The overall algorithm thus involves only point-to-point communication between neighboring processors. Our implementation of the tridiagonal solver differs from and avoids the drawbacks of past efforts in the following ways: it introduces no parallelization-related approximations (multiprocessor solutions are exactly identical to uniprocessor ones), it involves minimal communication, the mathematical complexity is similar to that of the Thomas algorithm on a single processor, and it does not require any communication and computation scheduling.

  16. Measures and mechanisms of common ground: backchannels, conversational repair, and interactive alignmentin free and task-oriented social interactions

    DEFF Research Database (Denmark)

    Fusaroli, Riccardo; Tylén, Kristian; Madsen, Katrine Garly

    A crucial aspect of everyday conversational interactions is our ability to establish and maintain common ground. Understanding the relevant mechanisms involved in such social coordination remains an important challenge for cognitive science. While common ground is often discussed in very general...... terms, different contexts of interaction are likely to afford different coordination mechanisms. In this paper, we investigate the presence and relation of three mechanisms of social coordination–backchannels, interactive alignment and conversational repair –across free and task-oriented conversations...... with backchannels. Our findings highlight the need to explicitly assess several mechanisms at once and to investigate diverse social activities to understand their role and relations....

  17. Bidirectional scalable motion for scalable video coding. (United States)

    Chen, Hu; Kao, Meng-Ping; Nguyen, Truong Q


    Motion information scalability is an important requirement for a fully scalable video codec, especially in low bit rate or small resolution decoding scenarios, for which the fully scalable motion model (SMM) has been proposed. SMM can collaborate flawlessly with other scalabilities, such as spatial, temporal and quality, in a scalable video codec. It performs better than the nonscalable motion model. To further improve the SMM, this paper extends the algorithm to support the hierarchical B frame structure and bidirectional or multidirectional motion estimation. Furthermore, the corresponding rate distortion optimized estimation for improved efficiency in several scenarios is discussed. Several simulation results based on the updated framework are presented to verify the advantage of this extension.

  18. Predictive power of task orientation, general self-efficacy and self-determined motivation on fun and boredom

    Directory of Open Access Journals (Sweden)

    Lorena Ruiz-González


    Full Text Available Abstract The aim of this study was to test the predictive power of dispositional orientations, general self-efficacy and self-determined motivation on fun and boredom in physical education classes, with a sample of 459 adolescents between 13 and 18 with a mean age of 15 years (SD = 0.88. The adolescents responded to four Likert scales: Perceptions of Success Questionnaire, General Self-Efficacy Scale, Sport Motivation Scale and Intrinsic Satisfaction Questionnaire in Sport. The results showed the structural regression model showed that task orientation and general self-efficacy positively predicted self-determined motivation and this in turn positively predicted more fun and less boredom in physical education classes. Consequently, the promotion of an educational task-oriented environment where learners perceive their progress and make them feel more competent, will allow them to overcome the intrinsically motivated tasks, and therefore they will have more fun. Pedagogical implications for less boredom and more fun in physical education classes are discussed.

  19. An Agent-Based Simulation for Investigating the Impact of Stereotypes on Task-Oriented Group Formation (United States)

    Maghami, Mahsa; Sukthankar, Gita

    In this paper, we introduce an agent-based simulation for investigating the impact of social factors on the formation and evolution of task-oriented groups. Task-oriented groups are created explicitly to perform a task, and all members derive benefits from task completion. However, even in cases when all group members act in a way that is locally optimal for task completion, social forces that have mild effects on choice of associates can have a measurable impact on task completion performance. In this paper, we show how our simulation can be used to model the impact of stereotypes on group formation. In our simulation, stereotypes are based on observable features, learned from prior experience, and only affect an agent's link formation preferences. Even without assuming stereotypes affect the agents' willingness or ability to complete tasks, the long-term modifications that stereotypes have on the agents' social network impair the agents' ability to form groups with sufficient diversity of skills, as compared to agents who form links randomly. An interesting finding is that this effect holds even in cases where stereotype preference and skill existence are completely uncorrelated.

  20. Joint Experimentation on Scalable Parallel Processors (JESPP) (United States)


    on the order of 50 Mb/s to the JESPP. Our personnel assisted J9 in designing, selecting, installing and initializing a “ Beowulf ” Linux cluster at J9...CPUs that are connected by a high speed network. One of the major types of systems run by the HPC centers is the Beowulf cluster. (Sterling et al...1995) Beowulf clusters have become popular systems in the world of HPC systems because of their low cost for the amount of power they provide. A

  1. Joint Experimentation on Scalable Parallel Processors (JESPP)

    National Research Council Canada - National Science Library

    Davis, Dan M; Lucas, Robert F; Yao, Ka-Thia; Wagenbrath, Gene


    ...) required expansion of its joint semi-automated forces (JSAF) code capabilities; including number of entities, behavior complexity, terrain resolution, infrastructure features, environmental realism, and analytical potential...

  2. Scalability of human models

    NARCIS (Netherlands)

    Rodarius, C.; Rooij, L. van; Lange, R. de


    The objective of this work was to create a scalable human occupant model that allows adaptation of human models with respect to size, weight and several mechanical parameters. Therefore, for the first time two scalable facet human models were developed in MADYMO. First, a scalable human male was

  3. The effectiveness of task-oriented intervention and trunk restraint on upper limb movement quality in children with cerebral palsy. (United States)

    Schneiberg, Sheila; McKinley, Patricia A; Sveistrup, Heidi; Gisel, Erika; Mayo, Nancy E; Levin, Mindy F


    The goal of this study was to contribute evidence towards the effectiveness of task-oriented training with and without restriction of trunk movement (trunk restraint) on the quality of upper limb movement in children with cerebral palsy (CP). We used a prospective, single-subject research design in 12 children (three males, nine females; aged 6-11 y; median 9 y) with di-, hemi-, or quadriplegia. Movements of the most affected arm were assessed five times: three times before training, immediately after training, and 3 months after training. The main outcome measures were the Melbourne Assessment of Unilateral Upper Limb Function (Melbourne) and upper limb movement kinematics during a functional reaching task. Children were randomly allocated to one of two groups: task-oriented training with or without trunk restraint. Treatment consisted of three 1-hour sessions per week for 5 weeks (total training duration 15 h). Treatment effects were determined using single-subject research design analysis--regression through baseline data and standard mean differences. Although the Melbourne scores were largely unchanged after training, some children in each group improved arm trajectory smoothness (effect size 0.55-1.87), and most children improved elbow extension range (effect size 0.55-4.79). However, more children in the trunk restraint group than in the no restraint group demonstrated reduced trunk displacement (effect size 0.94-2.25) and longer-term improvements in elbow extension and trunk use. Among the group who underwent training without trunk restraint, trunk displacement was unchanged or increased, and fewer carry-over effects were apparent at follow-up. This proof-of-principle study showed that greater improvement in the quality of upper limb movement in children with CP, including less compensatory trunk use and better carry-over effects, was achieved by training with trunk restraint. © The Authors. Journal compilation © Mac Keith Press 2010.

  4. Scalable rendering on PC clusters

    Energy Technology Data Exchange (ETDEWEB)



    This case study presents initial results from research targeted at the development of cost-effective scalable visualization and rendering technologies. The implementations of two 3D graphics libraries based on the popular sort-last and sort-middle parallel rendering techniques are discussed. An important goal of these implementations is to provide scalable rendering capability for extremely large datasets (>> 5 million polygons). Applications can use these libraries for either run-time visualization, by linking to an existing parallel simulation, or for traditional post-processing by linking to an interactive display program. The use of parallel, hardware-accelerated rendering on commodity hardware is leveraged to achieve high performance. Current performance results show that, using current hardware (a small 16-node cluster), they can utilize up to 85% of the aggregate graphics performance and achieve rendering rates in excess of 20 million polygons/second using OpenGL{reg_sign} with lighting, Gouraud shading, and individually specified triangles (not t-stripped).

  5. Highly Scalable Eigensolvers for Petaflop Applications


    Auckenthaler, Thomas


    This thesis presents the development of a new eigensolver for the use in massively parallel systems. Current implementations lack in both, parallel and sequential efficiency on modern computer architectures and are becoming the bottleneck of many scientific applications, e.g. in quantum chemistry. Efficiency and scalability of the new eigensolver are unprecedented and result in an up to 10-fold improvement compared to current state-of-the-art libraries. Diese Arbeit beschreibt die Entwick...

  6. Joint Experiment on Scalable Parallel Processors (JESPP) Parallel Data Management (United States)


    thousand entities to a WAN including multiple Beowulf clusters and hundreds of processors simulating hundreds of thousands of entities. larger simulations on Beowulf clusters ISI implemented a distributed logger. Data is logged locally on each processor running a simulator...development and execution effort (Lucas, 2003). Common SPPs include the IBM SP, SGI Origin, Cray T3E, and the “ Beowulf ” Linux clusters. Traditionally

  7. Core stability exercise is as effective as task-oriented motor training in improving motor proficiency in children with developmental coordination disorder: a randomized controlled pilot study. (United States)

    Au, Mei K; Chan, Wai M; Lee, Lin; Chen, Tracy Mk; Chau, Rosanna Mw; Pang, Marco Yc


    To compare the effectiveness of a core stability program with a task-oriented motor training program in improving motor proficiency in children with developmental coordination disorder (DCD). Randomized controlled pilot trial. Outpatient unit in a hospital. Twenty-two children diagnosed with DCD aged 6-9 years were randomly allocated to the core stability program or the task-oriented motor program. Both groups underwent their respective face-to-face training session once per week for eight consecutive weeks. They were also instructed to carry out home exercises on a daily basis during the intervention period. Short Form of the Bruininks-Oseretsky Test of Motor Proficiency (Second Edition) and Sensory Organization Test at pre- and post-intervention. Intention-to-treat analysis revealed no significant between-group difference in the change of motor proficiency standard score (P=0.717), and composite equilibrium score derived from the Sensory Organization Test (P=0.100). Further analysis showed significant improvement in motor proficiency in both the core stability (mean change (SD)=6.3(5.4); p=0.008) and task-oriented training groups (mean change(SD)=5.1(4.0); P=0.007). The composite equilibrium score was significantly increased in the task-oriented training group (mean change (SD)=6.0(5.5); P=0.009), but not in the core stability group (mean change(SD) =0.0(9.6); P=0.812). In the task-oriented training group, compliance with the home program was positively correlated with change in motor proficiency (ρ=0.680, P=0.030) and composite equilibrium score (ρ=0.638, P=0.047). The core stability exercise program is as effective as task-oriented training in improving motor proficiency among children with DCD. © The Author(s) 2014.

  8. Scalable Frequent Subgraph Mining

    KAUST Repository

    Abdelhamid, Ehab


    A graph is a data structure that contains a set of nodes and a set of edges connecting these nodes. Nodes represent objects while edges model relationships among these objects. Graphs are used in various domains due to their ability to model complex relations among several objects. Given an input graph, the Frequent Subgraph Mining (FSM) task finds all subgraphs with frequencies exceeding a given threshold. FSM is crucial for graph analysis, and it is an essential building block in a variety of applications, such as graph clustering and indexing. FSM is computationally expensive, and its existing solutions are extremely slow. Consequently, these solutions are incapable of mining modern large graphs. This slowness is caused by the underlying approaches of these solutions which require finding and storing an excessive amount of subgraph matches. This dissertation proposes a scalable solution for FSM that avoids the limitations of previous work. This solution is composed of four components. The first component is a single-threaded technique which, for each candidate subgraph, needs to find only a minimal number of matches. The second component is a scalable parallel FSM technique that utilizes a novel two-phase approach. The first phase quickly builds an approximate search space, which is then used by the second phase to optimize and balance the workload of the FSM task. The third component focuses on accelerating frequency evaluation, which is a critical step in FSM. To do so, a machine learning model is employed to predict the type of each graph node, and accordingly, an optimized method is selected to evaluate that node. The fourth component focuses on mining dynamic graphs, such as social networks. To this end, an incremental index is maintained during the dynamic updates. Only this index is processed and updated for the majority of graph updates. Consequently, search space is significantly pruned and efficiency is improved. The empirical evaluation shows that the

  9. Parallel Processing and Scientific Applications (United States)


    the algorithm have computational complexity O(N) and be scalable. Multigrid Methods Among the known scalable for elliptic PDE solution, the...close to linear in N and scales linearly in P. However multigrid methods have a particular difficulty on parallel machines in that they do not have...This inherent disadvantage of multigrid methods has led to the search for truly parallel MG methods - multigrid methods that can utilize all processors

  10. Effect of task-oriented training and high-variability practice on gross motor performance and activities of daily living in children with spastic diplegia. (United States)

    Kwon, Hae-Yeon; Ahn, So-Yoon


    [Purpose] This study investigates how a task-oriented training and high-variability practice program can affect the gross motor performance and activities of daily living for children with spastic diplegia and provides an effective and reliable clinical database for future improvement of motor performances skills. [Subjects and Methods] This study randomly assigned seven children with spastic diplegia to each intervention group including that of a control group, task-oriented training group, and a high-variability practice group. The control group only received neurodevelopmental treatment for 40 minutes, while the other two intervention groups additionally implemented a task-oriented training and high-variability practice program for 8 weeks (twice a week, 60 min per session). To compare intra and inter-relationships of the three intervention groups, this study measured gross motor performance measure (GMPM) and functional independence measure for children (WeeFIM) before and after 8 weeks of training. [Results] There were statistically significant differences in the amount of change before and after the training among the three intervention groups for the gross motor performance measure and functional independence measure. [Conclusion] Applying high-variability practice in a task-oriented training course may be considered an efficient intervention method to improve motor performance skills that can tune to movement necessary for daily livelihood through motor experience and learning of new skills as well as change of tasks learned in a complex environment or similar situations to high-variability practice.

  11. Behavior Modification in the Classroom. Final Report. Production and Evaluation of a Film about Behavior Techniques to Increase Task Oriented Behavior. (United States)

    Stanford Univ., CA. Dept. of Communication.

    A 21 minute sound color film was produced in an attempt to help teachers and educational specialists see people like themselves helping youngsters become more task oriented by the use of operant conditioning and modeling procedures formulated by Skinner (1953) and Bandura (1964). The unstructured and unrehearsed film consists of edited film…

  12. An evaluation of five bedside information products using a user-centered, task-oriented approach*†‡ (United States)

    Campbell, Rose; Ash, Joan


    Purpose: The paper compares several bedside information tools using user-centered, task-oriented measures to assist those making or supporting purchasing decisions. Methods: Eighteen potential users were asked to attempt to answer clinical questions using five commercial products (ACP's PIER, DISEASEDEX, FIRSTConsult, InfoRetriever, and UpToDate). Users evaluated each tool for ease-of-use and user satisfaction. The average number of questions answered and user satisfaction were measured for each product. Results: Results show no significant differences in user perceptions of content quality. However, user interaction measures (such as screen layout) show a significant preference for the UpToDate product. In addition, users found answers to significantly more questions using UpToDate. Conclusion: When evaluating electronic products designed for use at the point of care, the user interaction aspects of a product become as important as more traditional content-based measures of quality. Actual or potential users of such products are appropriately equipped to identify which products rate the highest on these measures. PMID:17082836

  13. Scalable, distributed data mining using an agent based architecture

    Energy Technology Data Exchange (ETDEWEB)

    Kargupta, H.; Hamzaoglu, I.; Stafford, B.


    Algorithm scalability and the distributed nature of both data and computation deserve serious attention in the context of data mining. This paper presents PADMA (PArallel Data Mining Agents), a parallel agent based system, that makes an effort to address these issues. PADMA contains modules for (1) parallel data accessing operations, (2) parallel hierarchical clustering, and (3) web-based data visualization. This paper describes the general architecture of PADMA and experimental results.

  14. A scalable distributed RRT for motion planning

    KAUST Repository

    Jacobs, Sam Ade


    Rapidly-exploring Random Tree (RRT), like other sampling-based motion planning methods, has been very successful in solving motion planning problems. Even so, sampling-based planners cannot solve all problems of interest efficiently, so attention is increasingly turning to parallelizing them. However, one challenge in parallelizing RRT is the global computation and communication overhead of nearest neighbor search, a key operation in RRTs. This is a critical issue as it limits the scalability of previous algorithms. We present two parallel algorithms to address this problem. The first algorithm extends existing work by introducing a parameter that adjusts how much local computation is done before a global update. The second algorithm radially subdivides the configuration space into regions, constructs a portion of the tree in each region in parallel, and connects the subtrees,i removing cycles if they exist. By subdividing the space, we increase computation locality enabling a scalable result. We show that our approaches are scalable. We present results demonstrating almost linear scaling to hundreds of processors on a Linux cluster and a Cray XE6 machine. © 2013 IEEE.

  15. Foreign language learning using e-mail in a task-oriented perspective: Interuniversity experiments in communication and collaboration (United States)

    Barson, John; Frommer, Judith; Schwartz, Michael


    From 1988 to 1990 several collaborative “cross-country” intermediate French classes at Harvard and Stanford became one class. Students combined their efforts and insights in the accomplishment of a semester-long task, in most cases the publication of a student newspaper or magazine, using the electronic mail (e-mail) network to contact each other, elaborate their plans, and bring their projects to successful conclusion. Additional experiments of a similar nature took place between Harvard and the University of Pittsburgh (in the spring of 1990) and between Stanford and the University of Pittsburgh during 1991 1993. This paper suggests that this type of task-oriented learning through distance-communication is applicable at many different course levels and has considerable merit as an approach to teaching and learning. The key phases of this task-based model are presented along with technological information regarding computers and networks, as a guide to colleagues interested in pursuing similar lines of experimental teaching. Also included are samples of student messages, with their varied and often highly colorful discourse features, which attest to the motivation of students and reveal the strong personal investment made by the participants as they join hands across the miles in a productive, communication-based enterprise. The language and learning styles generated by technology and computers fully deserve closer investigation by researchers and teaching practitioners alike. The authors summarize the experiments, discuss assessment, and present research issues, concluding that good pedagogy and quality technology must share a vision of what can be accomplished in this rapidly evolving educational work place.

  16. A fully scalable motion model for scalable video coding. (United States)

    Kao, Meng-Ping; Nguyen, Truong


    Motion information scalability is an important requirement for a fully scalable video codec, especially for decoding scenarios of low bit rate or small image size. So far, several scalable coding techniques on motion information have been proposed, including progressive motion vector precision coding and motion vector field layered coding. However, it is still vague on the required functionalities of motion scalability and how it collaborates flawlessly with other scalabilities, such as spatial, temporal, and quality, in a scalable video codec. In this paper, we first define the functionalities required for motion scalability. Based on these requirements, a fully scalable motion model is proposed along with tailored encoding techniques to minimize the coding overhead of scalability. Moreover, the associated rate distortion optimized motion estimation algorithm will be provided to achieve better efficiency throughout various decoding scenarios. Simulation results will be presented to verify the superiorities of proposed scalable motion model over nonscalable ones.

  17. Scalable Atomistic Simulation Algorithms for Materials Research

    Directory of Open Access Journals (Sweden)

    Aiichiro Nakano


    Full Text Available A suite of scalable atomistic simulation programs has been developed for materials research based on space-time multiresolution algorithms. Design and analysis of parallel algorithms are presented for molecular dynamics (MD simulations and quantum-mechanical (QM calculations based on the density functional theory. Performance tests have been carried out on 1,088-processor Cray T3E and 1,280-processor IBM SP3 computers. The linear-scaling algorithms have enabled 6.44-billion-atom MD and 111,000-atom QM calculations on 1,024 SP3 processors with parallel efficiency well over 90%. production-quality programs also feature wavelet-based computational-space decomposition for adaptive load balancing, spacefilling-curve-based adaptive data compression with user-defined error bound for scalable I/O, and octree-based fast visibility culling for immersive and interactive visualization of massive simulation data.

  18. Task-oriented robot-assisted stroke therapy of paretic limb improves control in a unilateral and bilateral functional drink task: a case study. (United States)

    Christopher, Seethu M; Johnson, Michelle J


    The purpose of this paper is to evaluate the functional, temporal and spatial effect of a unilateral task-oriented, robot-assisted training on unilateral and bilateral task performance of a drinking task using a real object. Two chronic stroke survivors experienced task-oriented robot assisted therapy, in which the paretic arm was trained using reaching and grasping tasks over 4 weeks. Both subjects experienced improvement in motor control as measured by Fugl-Meyer. The paretic arm was evaluated using movement smoothness (MS) and time to completion (TCT) measures before and after therapy. From the results, we found that the unilateral robotassisted training improved paretic arm control in the unilateral and the bilateral drink task. However, the influence of the non-paretic movement on the temporal and spatial paretic arm control was evident both pre and post therapy suggesting inter-limb coupling aids in the transfer of unilateral improvements in motor control to improvements in bilateral motor control.

  19. The effect of progressive task-oriented training on a supplementary tilt table on lower extremity muscle strength and gait recovery in patients with hemiplegic stroke. (United States)

    Kim, Chang-Yong; Lee, Jung-Sun; Kim, Hyeong-Dong; Kim, June-Sun


    The purpose of this study was to determine the influence of progressive task-oriented training on a supplementary tilt table on the lower extremity (LE) muscle strength and spatiotemporal parameters of gait in subjects with hemiplegic stroke. Thirty subjects between three and nine months post stroke were included in this study. Thirty subjects were randomly allocated to a control group (CG, n1=10), experimental group I (EG1, n2=10), and experimental group II (EG2, n3=10). All of the subjects received routine therapy for half an hour, five times a week for three weeks and additionally received training on the following three different tilt table applications for 20min a day: (1) both knee belts of the tilt table were fastened (CG), (2) only the affected side knee belt of the tilt table was fastened and one-leg standing training was performed using the less-affected LE (EG1), and (3) only the affected side knee belt of the tilt table was fastened and progressive task-oriented training was performed using the less-affected LE (EG2). The effect of tilt table applications was assessed using a hand-held dynamometer for LE muscle strength and GAITRite for spatiotemporal gait data. Our results showed that there was a significantly greater increase in the strength of all LE muscle groups, gait velocity, cadence, and stride length, a decrease in the double limb support period, and an improvement in gait asymmetry in subjects who underwent progressive task-oriented training on a supplementary tilt table compared to those in the other groups. These findings suggest that progressive task-oriented training on a supplementary tilt table can improve the LE muscle strength and spatiotemporal parameters of gait at an early stage of rehabilitation of subjects with hemiplegic stroke. Copyright © 2014 Elsevier B.V. All rights reserved.

  20. Parallel hierarchical radiosity rendering

    Energy Technology Data Exchange (ETDEWEB)

    Carter, Michael [Iowa State Univ., Ames, IA (United States)


    In this dissertation, the step-by-step development of a scalable parallel hierarchical radiosity renderer is documented. First, a new look is taken at the traditional radiosity equation, and a new form is presented in which the matrix of linear system coefficients is transformed into a symmetric matrix, thereby simplifying the problem and enabling a new solution technique to be applied. Next, the state-of-the-art hierarchical radiosity methods are examined for their suitability to parallel implementation, and scalability. Significant enhancements are also discovered which both improve their theoretical foundations and improve the images they generate. The resultant hierarchical radiosity algorithm is then examined for sources of parallelism, and for an architectural mapping. Several architectural mappings are discussed. A few key algorithmic changes are suggested during the process of making the algorithm parallel. Next, the performance, efficiency, and scalability of the algorithm are analyzed. The dissertation closes with a discussion of several ideas which have the potential to further enhance the hierarchical radiosity method, or provide an entirely new forum for the application of hierarchical methods.

  1. Scalable Performance Measurement and Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Gamblin, Todd [Univ. of North Carolina, Chapel Hill, NC (United States)


    Concurrency levels in large-scale, distributed-memory supercomputers are rising exponentially. Modern machines may contain 100,000 or more microprocessor cores, and the largest of these, IBM's Blue Gene/L, contains over 200,000 cores. Future systems are expected to support millions of concurrent tasks. In this dissertation, we focus on efficient techniques for measuring and analyzing the performance of applications running on very large parallel machines. Tuning the performance of large-scale applications can be a subtle and time-consuming task because application developers must measure and interpret data from many independent processes. While the volume of the raw data scales linearly with the number of tasks in the running system, the number of tasks is growing exponentially, and data for even small systems quickly becomes unmanageable. Transporting performance data from so many processes over a network can perturb application performance and make measurements inaccurate, and storing such data would require a prohibitive amount of space. Moreover, even if it were stored, analyzing the data would be extremely time-consuming. In this dissertation, we present novel methods for reducing performance data volume. The first draws on multi-scale wavelet techniques from signal processing to compress systemwide, time-varying load-balance data. The second uses statistical sampling to select a small subset of running processes to generate low-volume traces. A third approach combines sampling and wavelet compression to stratify performance data adaptively at run-time and to reduce further the cost of sampled tracing. We have integrated these approaches into Libra, a toolset for scalable load-balance analysis. We present Libra and show how it can be used to analyze data from large scientific applications scalably.

  2. Effect of low-frequency repetitive transcranial magnetic stimulation combining task-oriented training on upper limb motor function recovery after stroke

    Directory of Open Access Journals (Sweden)

    Hong-bin WANG


    Full Text Available Objective To investigate the effect of low-frequency repetitive transcranial magnetic stimulation (rTMS combined with task-oriented training on the recovery of upper limb motor function of stroke patients. Methods A total of 42 patients with hemiplegia after stroke were randomly divided into control group (N = 20 and treatment group (N = 22. Control group received routine rehabilitation training and task-oriented training, and treatment group received low-frequency (1 Hz rTMS over the contralesional cortex addition to routine rehabilitation and task-oriented training. Fugl-Meyer Assessment Scale for Upper Extremity (FMA-UE and Wolf Motor Function Test (WMFT were used to evaluate upper limb motor function of all patients before treatment, after 4-week treatment and 3 months after treatment. The latency and central motor conduction time (CMCT of motor-evoked potential (MEP in the contralesional cortex were recorded and analyzed. Results Compared with control group, FMA-UE score (P = 0.006 and WMFT score (P = 0.024 were significantly increased in treatment group. There was significant difference in FMA-AUE score (P = 0.000 and WMFT score (P = 0.000 at different time points. Compared with before treatment, FMA-UE score (P = 0.000, for all and WMFT score (P = 0.000, for all of patients in both groups were all significantly increased after 4-week treatment and 3 months after treatment. Besides, FMA-UE score (P = 0.000, for all and WMFT score (P = 0.000, for all 3 months after treatment were higher than those after 4-week treatment. There was no statistically significant difference between 2 groups on the latency (P = 0.979 and CMCT (P = 0.807 of MEP before and after treatment, and so was the difference on the latency (P = 0.085 and CMCT (P = 0.507 of MEP in the contralesional cortex at different time points (before treatment, after 4-week treatment and 3 months after treatment. Conclusions Low-frequency rTMS over the contralesional cortex combined

  3. Scalable coherent interface: Links to the future (United States)

    Gustavson, D. B.; Kristiansen, E.


    The Scalable Coherent Interface (SCI) was developed to support closely coupled multiprocessors and their caches in a distributed shared-memory environment, but its scalability and the efficient generality of its architecture make it work very well over a wide range of applications. It can replace a local area network for connecting workstations on a campus. It can be a powerful I/O channel for a supercomputer. It can be the processor, cache-memory I/O connection in a highly parallel computer. It can gather data from enormous particle detectors and distribute it among thousands of processors. It can connect a desktop microprocessor to memory chips a few millimeters away, disk drivers a few meters away, and servers a few kilometers away.

  4. Scalable Resolution Display Walls

    KAUST Repository

    Leigh, Jason


    This article will describe the progress since 2000 on research and development in 2-D and 3-D scalable resolution display walls that are built from tiling individual lower resolution flat panel displays. The article will describe approaches and trends in display hardware construction, middleware architecture, and user-interaction design. The article will also highlight examples of use cases and the benefits the technology has brought to their respective disciplines. © 1963-2012 IEEE.

  5. SPINning parallel systems software

    International Nuclear Information System (INIS)

    Matlin, O.S.; Lusk, E.; McCune, W.


    We describe our experiences in using Spin to verify parts of the Multi Purpose Daemon (MPD) parallel process management system. MPD is a distributed collection of processes connected by Unix network sockets. MPD is dynamic processes and connections among them are created and destroyed as MPD is initialized, runs user processes, recovers from faults, and terminates. This dynamic nature is easily expressible in the Spin/Promela framework but poses performance and scalability challenges. We present here the results of expressing some of the parallel algorithms of MPD and executing both simulation and verification runs with Spin

  6. Scalable Simulation of Electromagnetic Hybrid Codes

    International Nuclear Information System (INIS)

    Perumalla, Kalyan S.; Fujimoto, Richard; Karimabadi, Dr. Homa


    New discrete-event formulations of physics simulation models are emerging that can outperform models based on traditional time-stepped techniques. Detailed simulation of the Earth's magnetosphere, for example, requires execution of sub-models that are at widely differing timescales. In contrast to time-stepped simulation which requires tightly coupled updates to entire system state at regular time intervals, the new discrete event simulation (DES) approaches help evolve the states of sub-models on relatively independent timescales. However, parallel execution of DES-based models raises challenges with respect to their scalability and performance. One of the key challenges is to improve the computation granularity to offset synchronization and communication overheads within and across processors. Our previous work was limited in scalability and runtime performance due to the parallelization challenges. Here we report on optimizations we performed on DES-based plasma simulation models to improve parallel performance. The net result is the capability to simulate hybrid particle-in-cell (PIC) models with over 2 billion ion particles using 512 processors on supercomputing platforms

  7. Scalable optical quantum computer

    International Nuclear Information System (INIS)

    Manykin, E A; Mel'nichenko, E V


    A way of designing a scalable optical quantum computer based on the photon echo effect is proposed. Individual rare earth ions Pr 3+ , regularly located in the lattice of the orthosilicate (Y 2 SiO 5 ) crystal, are suggested to be used as optical qubits. Operations with qubits are performed using coherent and incoherent laser pulses. The operation protocol includes both the method of measurement-based quantum computations and the technique of optical computations. Modern hybrid photon echo protocols, which provide a sufficient quantum efficiency when reading recorded states, are considered as most promising for quantum computations and communications. (quantum computer)

  8. Massively Parallel Finite Element Programming

    KAUST Repository

    Heister, Timo


    Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.

  9. The addition of functional task-oriented mental practice to conventional physical therapy improves motor skills in daily functions after stroke

    Directory of Open Access Journals (Sweden)

    Clarissa C. Santos-Couto-Paz


    Full Text Available BACKGROUND: Mental practice (MP is a cognitive strategy which may improve the acquisition of motor skills and functional performance of athletes and individuals with neurological injuries. OBJECTIVE: To determine whether an individualized, specific functional task-oriented MP, when added to conventional physical therapy (PT, promoted better learning of motor skills in daily functions in individuals with chronic stroke (13±6.5 months post-stroke. METHOD: Nine individuals with stable mild and moderate upper limb impairments participated, by employing an A1-B-A2 single-case design. Phases A1 and A2 included one month of conventional PT, and phase B the addition of MP training to PT. The motor activity log (MAL-Brazil was used to assess the amount of use (AOU and quality of movement (QOM of the paretic upper limb; the revised motor imagery questionnaire (MIQ-RS to assess the abilities in kinesthetic and visual motor imagery; the Minnesota manual dexterity test to assess manual dexterity; and gait speed to assess mobility. RESULTS: After phase A1, no significant changes were observed for any of the outcome measures. However, after phase B, significant improvements were observed for the MAL, AOU and QOM scores (p<0.0001, and MIQ-RS kinesthetic and visual scores (p=0.003; p=0.007, respectively. The significant gains in manual dexterity (p=0.002 and gait speed (p=0.019 were maintained after phase A2. CONCLUSIONS: Specific functional task-oriented MP, when added to conventional PT, led to improvements in motor imagery abilities combined with increases in the AOU and QOM in daily functions, manual dexterity, and gait speed.

  10. Development and Evaluation of Self-Management and Task-Oriented Approach to Rehabilitation Training (START) in the Home: Case Report. (United States)

    Richardson, Julie; DePaul, Vincent; Officer, Alexis; Wilkins, Seanne; Letts, Lori; Bosch, Jackie; Wishart, Laurie


    The incidence of stroke and subsequent level of disability will increase, as age is the greatest risk factor for stroke and the world's population is aging. Hospital admissions are too brief for patients to regain necessary function. Research to examine therapy delivered within the home environment has the potential to expedite relearning of function and reduce health care expenditures. This case report describes the use of the knowledge-to-action cycle (KTA) to develop and evaluate an evidence-based approach for rehabilitation in the home that incorporates self-management and task-oriented functional training (TOFT) for people with stroke. The KTA cycle was used to guide adaptation of evidence from self-management and TOFT into an approach titled START (Self-Management and Task-Oriented Approach to Rehabilitation Training). Three stakeholder symposiums identified barriers and supports to implementation. Clinical practice leaders were engaged as partners in the development of the intervention. An online learning management system housed the resources to support therapist training. Therapist focus groups were conducted and stroke outcomes were used to assess patient response. Eight therapists completed 4 workshops and applied the home intervention in 12 people with stroke. A mentoring process for therapists included feedback from peers and experts after viewing treatment videos. Therapist response was determined from the focus groups; patient response was measured by standardized assessments. The therapists noted that the intervention was easier to implement with patients who were motivated and had minimal cognitive impairment. The KTA cycle provided a structure for the development of this evidence-based rehabilitation intervention, which was feasible to implement in the home. Further evaluation needs to be undertaken to assess the effectiveness of START. © 2015 American Physical Therapy Association.

  11. Scalable photoreactor for hydrogen production

    KAUST Repository

    Takanabe, Kazuhiro


    Provided herein are scalable photoreactors that can include a membrane-free water- splitting electrolyzer and systems that can include a plurality of membrane-free water- splitting electrolyzers. Also provided herein are methods of using the scalable photoreactors provided herein.

  12. Highly scalable Ab initio genomic motif identification

    KAUST Repository

    Marchand, Benoit


    We present results of scaling an ab initio motif family identification system, Dragon Motif Finder (DMF), to 65,536 processor cores of IBM Blue Gene/P. DMF seeks groups of mutually similar polynucleotide patterns within a set of genomic sequences and builds various motif families from them. Such information is of relevance to many problems in life sciences. Prior attempts to scale such ab initio motif-finding algorithms achieved limited success. We solve the scalability issues using a combination of mixed-mode MPI-OpenMP parallel programming, master-slave work assignment, multi-level workload distribution, multi-level MPI collectives, and serial optimizations. While the scalability of our algorithm was excellent (94% parallel efficiency on 65,536 cores relative to 256 cores on a modest-size problem), the final speedup with respect to the original serial code exceeded 250,000 when serial optimizations are included. This enabled us to carry out many large-scale ab initio motiffinding simulations in a few hours while the original serial code would have needed decades of execution time. Copyright 2011 ACM.

  13. Design for scalability in 3D computer graphics architectures

    DEFF Research Database (Denmark)

    Holten-Lund, Hans Erik


    This thesis describes useful methods and techniques for designing scalable hybrid parallel rendering architectures for 3D computer graphics. Various techniques for utilizing parallelism in a pipelines system are analyzed. During the Ph.D study a prototype 3D graphics architecture named Hybris has...... been developed. Hybris is a prototype rendering architeture which can be tailored to many specific 3D graphics applications and implemented in various ways. Parallel software implementations for both single and multi-processor Windows 2000 system have been demonstrated. Working hardware...... as a case study and an application of the Hybris graphics architecture....

  14. Scalable conditional induction variables (CIV) analysis

    KAUST Repository

    Oancea, Cosmin E.


    Subscripts using induction variables that cannot be expressed as a formula in terms of the enclosing-loop indices appear in the low-level implementation of common programming abstractions such as Alter, or stack operations and pose significant challenges to automatic parallelization. Because the complexity of such induction variables is often due to their conditional evaluation across the iteration space of loops we name them Conditional Induction Variables (CIV). This paper presents a flow-sensitive technique that summarizes both such CIV-based and affine subscripts to program level, using the same representation. Our technique requires no modifications of our dependence tests, which is agnostic to the original shape of the subscripts, and is more powerful than previously reported dependence tests that rely on the pairwise disambiguation of read-write references. We have implemented the CIV analysis in our parallelizing compiler and evaluated its impact on five Fortran benchmarks. We have found that that there are many important loops using CIV subscripts and that our analysis can lead to their scalable parallelization. This in turn has led to the parallelization of the benchmark programs they appear in.

  15. Effectiveness of a combination of cognitive behavioral therapy and task-oriented balance training in reducing the fear of falling in patients with chronic stroke: study protocol for a randomized controlled trial. (United States)

    Liu, Tai-Wa; Ng, Gabriel Y F; Ng, Shamay S M


    The consequences of falls are devastating for patients with stroke. Balance problems and fear of falling are two major challenges, and recent systematic reviews have revealed that habitual physical exercise training alone cannot reduce the occurrence of falls in stroke survivors. However, recent trials with community-dwelling healthy older adults yielded the promising result that interventions with a cognitive behavioral therapy (CBT) component can simultaneously promote balance and reduce the fear of falling. Therefore, the aim of the proposed clinical trial is to evaluate the effectiveness of a combination of CBT and task-oriented balance training (TOBT) in promoting subjective balance confidence, and thereby reducing fear-avoidance behavior, improving balance ability, reducing fall risk, and promoting independent living, community reintegration, and health-related quality of life of patients with stroke. The study will constitute a placebo-controlled single-blind parallel-group randomized controlled trial in which patients are assessed immediately, at 3 months, and at 12 months. The selected participants will be randomly allocated into one of two parallel groups (the experimental group and the control group) with a 1:1 ratio. Both groups will receive 45 min of TOBT twice per week for 8 weeks. In addition, the experimental group will receive a 45-min CBT-based group intervention, and the control group will receive 45 min of general health education (GHE) twice per week for 8 weeks. The primary outcome measure is subjective balance confidence. The secondary outcome measures are fear-avoidance behavior, balance ability, fall risk, level of activities of daily living, community reintegration, and health-related quality of life. The proposed clinical trial will compare the effectiveness of CBT combined with TOBT and GHE combined with TOBT in promoting subjective balance confidence among chronic stroke patients. We hope our results will provide evidence of a safe

  16. Distance-two interpolation for parallel algebraic multigrid

    International Nuclear Information System (INIS)

    Sterck, H de; Falgout, R D; Nolting, J W; Yang, U M


    In this paper we study the use of long distance interpolation methods with the low complexity coarsening algorithm PMIS. AMG performance and scalability is compared for classical as well as long distance interpolation methods on parallel computers. It is shown that the increased interpolation accuracy largely restores the scalability of AMG convergence factors for PMIS-coarsened grids, and in combination with complexity reducing methods, such as interpolation truncation, one obtains a class of parallel AMG methods that enjoy excellent scalability properties on large parallel computers

  17. Scalable Parallelization of Skyline Computation for Multi-core Processors

    DEFF Research Database (Denmark)

    Chester, Sean; Sidlauskas, Darius; Assent, Ira


    of the skyline. However, they do not sufficiently minimize dominance tests and so are not competitive with state-of-the-art sequential algorithms. In this paper, we introduce a novel multi-core skyline algorithm, Hybrid, which processes points in blocks. It maintains a shared, global skyline among all threads...... on challenging workloads a 100-fold speedup over state-of-the-art multi-core algorithms and a 10-fold speedup with 16 cores over state-of-the-art sequential algorithms....

  18. PSHED: a simplified approach to developing parallel programs

    International Nuclear Information System (INIS)

    Mahajan, S.M.; Ramesh, K.; Rajesh, K.; Somani, A.; Goel, M.


    This paper presents a simplified approach in the forms of a tree structured computational model for parallel application programs. An attempt is made to provide a standard user interface to execute programs on BARC Parallel Processing System (BPPS), a scalable distributed memory multiprocessor. The interface package called PSHED provides a basic framework for representing and executing parallel programs on different parallel architectures. The PSHED package incorporates concepts from a broad range of previous research in programming environments and parallel computations. (author). 6 refs

  19. Using MPI to Implement Scalable Libraries (United States)

    Lusk, Ewing

    MPI is an instantiation of a general-purpose programming model, and high-performance implementations of the MPI standard have provided scalability for a wide range of applications. Ease of use was not an explicit goal of the MPI design process, which emphasized completeness, portability, and performance. Thus it is not surprising that MPI is occasionally criticized for being inconvenient to use and thus a drag on software developer productivity. One approach to the productivity issue is to use MPI to implement simpler programming models. Such models may limit the range of parallel algorithms that can be expressed, yet provide sufficient generality to benefit a significant number of applications, even from different domains.We illustrate this concept with the ADLB (Asynchronous, Dynamic Load-Balancing) library, which can be used to express manager/worker algorithms in such a way that their execution is scalable, even on the largestmachines. ADLB makes sophisticated use ofMPI functionality while providing an extremely simple API for the application programmer.We will describe it in the context of solving Sudoku puzzles and a nuclear physics Monte Carlo application currently running on tens of thousands of processors.

  20. Scalable fast multipole accelerated vortex methods

    KAUST Repository

    Hu, Qi


    The fast multipole method (FMM) is often used to accelerate the calculation of particle interactions in particle-based methods to simulate incompressible flows. To evaluate the most time-consuming kernels - the Biot-Savart equation and stretching term of the vorticity equation, we mathematically reformulated it so that only two Laplace scalar potentials are used instead of six. This automatically ensuring divergence-free far-field computation. Based on this formulation, we developed a new FMM-based vortex method on heterogeneous architectures, which distributed the work between multicore CPUs and GPUs to best utilize the hardware resources and achieve excellent scalability. The algorithm uses new data structures which can dynamically manage inter-node communication and load balance efficiently, with only a small parallel construction overhead. This algorithm can scale to large-sized clusters showing both strong and weak scalability. Careful error and timing trade-off analysis are also performed for the cutoff functions induced by the vortex particle method. Our implementation can perform one time step of the velocity+stretching calculation for one billion particles on 32 nodes in 55.9 seconds, which yields 49.12 Tflop/s.

  1. Scalable Computation of Streamlines on Very Large Datasets

    Energy Technology Data Exchange (ETDEWEB)

    Pugmire, David; Childs, Hank; Garth, Christoph; Ahern, Sean; Weber, Gunther H.


    Understanding vector fields resulting from large scientific simulations is an important and often difficult task. Streamlines, curves that are tangential to a vector field at each point, are a powerful visualization method in this context. Application of streamline-based visualization to very large vector field data represents a significant challenge due to the non-local and data-dependent nature of streamline computation, and requires careful balancing of computational demands placed on I/O, memory, communication, and processors. In this paper we review two parallelization approaches based on established parallelization paradigms (static decomposition and on-demand loading) and present a novel hybrid algorithm for computing streamlines. Our algorithm is aimed at good scalability and performance across the widely varying computational characteristics of streamline-based problems. We perform performance and scalability studies of all three algorithms on a number of prototypical application problems and demonstrate that our hybrid scheme is able to perform well in different settings.

  2. Towards scalable Byzantine fault-tolerant replication (United States)

    Zbierski, Maciej


    Byzantine fault-tolerant (BFT) replication is a powerful technique, enabling distributed systems to remain available and correct even in the presence of arbitrary faults. Unfortunately, existing BFT replication protocols are mostly load-unscalable, i.e. they fail to respond with adequate performance increase whenever new computational resources are introduced into the system. This article proposes a universal architecture facilitating the creation of load-scalable distributed services based on BFT replication. The suggested approach exploits parallel request processing to fully utilize the available resources, and uses a load balancer module to dynamically adapt to the properties of the observed client workload. The article additionally provides a discussion on selected deployment scenarios, and explains how the proposed architecture could be used to increase the dependability of contemporary large-scale distributed systems.

  3. Network selection, Information filtering and Scalable computation (United States)

    Ye, Changqing

    -complete factorizations, possibly with a high percentage of missing values. This promotes additional sparsity beyond rank reduction. Computationally, we design methods based on a ``decomposition and combination'' strategy, to break large-scale optimization into many small subproblems to solve in a recursive and parallel manner. On this basis, we implement the proposed methods through multi-platform shared-memory parallel programming, and through Mahout, a library for scalable machine learning and data mining, for mapReduce computation. For example, our methods are scalable to a dataset consisting of three billions of observations on a single machine with sufficient memory, having good timings. Both theoretical and numerical investigations show that the proposed methods exhibit significant improvement in accuracy over state-of-the-art scalable methods.

  4. Scalable cloud without dedicated storage (United States)

    Batkovich, D. V.; Kompaniets, M. V.; Zarochentsev, A. K.


    We present a prototype of a scalable computing cloud. It is intended to be deployed on the basis of a cluster without the separate dedicated storage. The dedicated storage is replaced by the distributed software storage. In addition, all cluster nodes are used both as computing nodes and as storage nodes. This solution increases utilization of the cluster resources as well as improves fault tolerance and performance of the distributed storage. Another advantage of this solution is high scalability with a relatively low initial and maintenance cost. The solution is built on the basis of the open source components like OpenStack, CEPH, etc.

  5. Myria: Scalable Analytics as a Service (United States)

    Howe, B.; Halperin, D.; Whitaker, A.


    At the UW eScience Institute, we're working to empower non-experts, especially in the sciences, to write and use data-parallel algorithms. To this end, we are building Myria, a web-based platform for scalable analytics and data-parallel programming. Myria's internal model of computation is the relational algebra extended with iteration, such that every program is inherently data-parallel, just as every query in a database is inherently data-parallel. But unlike databases, iteration is a first class concept, allowing us to express machine learning tasks, graph traversal tasks, and more. Programs can be expressed in a number of languages and can be executed on a number of execution environments, but we emphasize a particular language called MyriaL that supports both imperative and declarative styles and a particular execution engine called MyriaX that uses an in-memory column-oriented representation and asynchronous iteration. We deliver Myria over the web as a service, providing an editor, performance analysis tools, and catalog browsing features in a single environment. We find that this web-based "delivery vector" is critical in reaching non-experts: they are insulated from irrelevant effort technical work associated with installation, configuration, and resource management. The MyriaX backend, one of several execution runtimes we support, is a main-memory, column-oriented, RDBMS-on-the-worker system that supports cyclic data flows as a first-class citizen and has been shown to outperform competitive systems on 100-machine cluster sizes. I will describe the Myria system, give a demo, and present some new results in large-scale oceanographic microbiology.

  6. The STAPL Parallel Graph Library

    KAUST Repository



    This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable distributed graph container and a collection of commonly used parallel graph algorithms. The library introduces pGraph pViews that separate algorithm design from the container implementation. It supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them. Experimental results demonstrate improved scalability in performance and data size over existing graph libraries on more than 16,000 cores and on internet-scale graphs containing over 16 billion vertices and 250 billion edges. © Springer-Verlag Berlin Heidelberg 2013.

  7. High quality scalable audio codec (United States)

    Kim, Miyoung; Oh, Eunmi; Kim, JungHoe


    The MPEG-4 BSAC (Bit Sliced Arithmetic Coding) is a fine-grain scalable codec with layered structure which consists of a single base-layer and several enhancement layers. The scalable functionality allows us to decode the subsets of a full bitstream and to deliver audio contents adaptively under conditions of heterogeneous network and devices, and user interaction. This bitrate scalability can be provided at the cost of high frequency components. It means that the decoded output of BSAC sounds muffled as the transmitted layers become less and less due to deprived conditions of network and devices. The goal of the proposed technology is to compensate the missing high frequency components, while maintaining the fine grain scalability of BSAC. This paper describes the integration of SBR (Spectral Bandwidth Replication) tool to existing MPEG-4 BSAC. Listening test results show that the sound quality of BSAC is improved when the full bitstream is truncated for lower bitrates, and this quality is comparable to that of BSAC using SBR tool without truncation at the same bitrate.

  8. Scalability study of solid xenon

    Energy Technology Data Exchange (ETDEWEB)

    Yoo, J.; Cease, H.; Jaskierny, W. F.; Markley, D.; Pahlka, R. B.; Balakishiyeva, D.; Saab, T.; Filipenko, M.


    We report a demonstration of the scalability of optically transparent xenon in the solid phase for use as a particle detector above a kilogram scale. We employed a cryostat cooled by liquid nitrogen combined with a xenon purification and chiller system. A modified {\\it Bridgeman's technique} reproduces a large scale optically transparent solid xenon.

  9. Parallel Algorithms for the Exascale Era

    Energy Technology Data Exchange (ETDEWEB)

    Robey, Robert W. [Los Alamos National Laboratory


    New parallel algorithms are needed to reach the Exascale level of parallelism with millions of cores. We look at some of the research developed by students in projects at LANL. The research blends ideas from the early days of computing while weaving in the fresh approach brought by students new to the field of high performance computing. We look at reproducibility of global sums and why it is important to parallel computing. Next we look at how the concept of hashing has led to the development of more scalable algorithms suitable for next-generation parallel computers. Nearly all of this work has been done by undergraduates and published in leading scientific journals.

  10. Arm rehabilitation in post stroke subjects: A randomized controlled trial on the efficacy of myoelectrically driven FES applied in a task-oriented approach. (United States)

    Jonsdottir, Johanna; Thorsen, Rune; Aprile, Irene; Galeri, Silvia; Spannocchi, Giovanna; Beghi, Ettore; Bianchi, Elisa; Montesano, Angelo; Ferrarin, Maurizio


    Motor recovery of persons after stroke may be enhanced by a novel approach where residual muscle activity is facilitated by patient-controlled electrical muscle activation. Myoelectric activity from hemiparetic muscles is then used for continuous control of functional electrical stimulation (MeCFES) of same or synergic muscles to promote restoration of movements during task-oriented therapy (TOT). Use of MeCFES during TOT may help to obtain a larger functional and neurological recovery than otherwise possible. Multicenter randomized controlled trial. Eighty two acute and chronic stroke victims were recruited through the collaborating facilities and after signing an informed consent were randomized to receive either the experimental (MeCFES assisted TOT (M-TOT) or conventional rehabilitation care including TOT (C-TOT). Both groups received 45 minutes of rehabilitation over 25 sessions. Outcomes were Action Research Arm Test (ARAT), Upper Extremity Fugl-Meyer Assessment (FMA-UE) scores and Disability of the Arm Shoulder and Hand questionnaire. Sixty eight subjects completed the protocol (Mean age 66.2, range 36.5-88.7, onset months 12.7, range 0.8-19.1) of which 45 were seen at follow up 5 weeks later. There were significant improvements in both groups on ARAT (median improvement: MeCFES TOT group 3.0; C-TOT group 2.0) and FMA-UE (median improvement: M-TOT 4.5; C-TOT 3.5). Considering subacute subjects (time since stroke rehabilitation (57.9%) than in the C-TOT group (33.2%) (difference in proportion improved 24.7%; 95% CI -4.0; 48.6), though the study did not meet the planned sample size. This is the first large multicentre RCT to compare MeCFES assisted TOT with conventional care TOT for the upper extremity. No adverse events or negative outcomes were encountered, thus we conclude that MeCFES can be a safe adjunct to rehabilitation that could promote recovery of upper limb function in persons after stroke, particularly when applied in the subacute phase.

  11. .NET 4.5 parallel extensions

    CERN Document Server

    Freeman, Bryan


    This book contains practical recipes on everything you will need to create task-based parallel programs using C#, .NET 4.5, and Visual Studio. The book is packed with illustrated code examples to create scalable programs.This book is intended to help experienced C# developers write applications that leverage the power of modern multicore processors. It provides the necessary knowledge for an experienced C# developer to work with .NET parallelism APIs. Previous experience of writing multithreaded applications is not necessary.

  12. Optimizing Nanoelectrode Arrays for Scalable Intracellular Electrophysiology. (United States)

    Abbott, Jeffrey; Ye, Tianyang; Ham, Donhee; Park, Hongkun


    , clarifying how the nanoelectrode attains intracellular access. This understanding will be translated into a circuit model for the nanobio interface, which we will then use to lay out the strategies for improving the interface. The intracellular interface of the nanoelectrode is currently inferior to that of the patch clamp electrode; reaching this benchmark will be an exciting challenge that involves optimization of electrode geometries, materials, chemical modifications, electroporation protocols, and recording/stimulation electronics, as we describe in the Account. Another important theme of this Account, beyond the optimization of the individual nanoelectrode-cell interface, is the scalability of the nanoscale electrodes. We will discuss this theme using a recent development from our groups as an example, where an array of ca. 1000 nanoelectrode pixels fabricated on a CMOS integrated circuit chip performs parallel intracellular recording from a few hundreds of cardiomyocytes, which marks a new milestone in electrophysiology.

  13. A look at scalable dense linear algebra libraries

    Energy Technology Data Exchange (ETDEWEB)

    Dongarra, J.J. (Tennessee Univ., Knoxville, TN (United States) Dept. of Computer Science Oak Ridge National Lab., TN (United States)); van de Geijn, R. (Texas Univ., Austin, TX (United States). Dept. of Computer Sciences); Walker, D.W. (Oak Ridge National Lab., TN (United States))


    We discuss the essential design features of a library of scalable software for performing dense linear algebra computations on distributed memory concurrent computers. The square block scattered decomposition is proposed as a flexible and general-purpose way of decomposing most, if not all, dense matrix problems. An object- oriented interface to the library permits more portable applications to be written, and is easy to learn and use, since details of the parallel implementation are hidden from the user. Experiments on the Intel Touchstone Delta system with a prototype code that uses the square block scattered decomposition to perform LU factorization are presented and analyzed. It was found that the code was both scalable and efficient, performing at about 14 Gflop/s (double precision) for the largest problem considered.

  14. CASTOR: Widely Distributed Scalable Infospaces (United States)


    Howie Shrobe, Jonathan Bachrach, Lester Foster. To Appear in Proceedings of the 2006 International Workshop on Wireless Ad-hoc and Sensor platform. ACM Queue, 4(4), May 2006. [16] S. D. Gribble, E. A. Brewer, J. M. Hellerstein, and D. E. Culler . Scalable, distributed data...Apache Software Foundation. Apache Axis, 2006. [29] M. Welsh, D. E. Culler , and E. A. Brewer. SEDA: An Ar- chitecture for

  15. Parallel computations

    CERN Document Server


    Parallel Computations focuses on parallel computation, with emphasis on algorithms used in a variety of numerical and physical applications and for many different types of parallel computers. Topics covered range from vectorization of fast Fourier transforms (FFTs) and of the incomplete Cholesky conjugate gradient (ICCG) algorithm on the Cray-1 to calculation of table lookups and piecewise functions. Single tridiagonal linear systems and vectorized computation of reactive flow are also discussed.Comprised of 13 chapters, this volume begins by classifying parallel computers and describing techn

  16. Building a parallel file system simulator

    International Nuclear Information System (INIS)

    Molina-Estolano, E; Maltzahn, C; Brandt, S A; Bent, J


    Parallel file systems are gaining in popularity in high-end computing centers as well as commercial data centers. High-end computing systems are expected to scale exponentially and to pose new challenges to their storage scalability in terms of cost and power. To address these challenges scientists and file system designers will need a thorough understanding of the design space of parallel file systems. Yet there exist few systematic studies of parallel file system behavior at petabyte- and exabyte scale. An important reason is the significant cost of getting access to large-scale hardware to test parallel file systems. To contribute to this understanding we are building a parallel file system simulator that can simulate parallel file systems at very large scale. Our goal is to simulate petabyte-scale parallel file systems on a small cluster or even a single machine in reasonable time and fidelity. With this simulator, file system experts will be able to tune existing file systems for specific workloads, scientists and file system deployment engineers will be able to better communicate workload requirements, file system designers and researchers will be able to try out design alternatives and innovations at scale, and instructors will be able to study very large-scale parallel file system behavior in the class room. In this paper we describe our approach and provide preliminary results that are encouraging both in terms of fidelity and simulation scalability.

  17. Memory-Scalable GPU Spatial Hierarchy Construction. (United States)

    Qiming Hou; Xin Sun; Kun Zhou; Lauterbach, C; Manocha, D


    Recent GPU algorithms for constructing spatial hierarchies have achieved promising performance for moderately complex models by using the breadth-first search (BFS) construction order. While being able to exploit the massive parallelism on the GPU, the BFS order also consumes excessive GPU memory, which becomes a serious issue for interactive applications involving very complex models with more than a few million triangles. In this paper, we propose to use the partial breadth-first search (PBFS) construction order to control memory consumption while maximizing performance. We apply the PBFS order to two hierarchy construction algorithms. The first algorithm is for kd-trees that automatically balances between the level of parallelism and intermediate memory usage. With PBFS, peak memory consumption during construction can be efficiently controlled without costly CPU-GPU data transfer. We also develop memory allocation strategies to effectively limit memory fragmentation. The resulting algorithm scales well with GPU memory and constructs kd-trees of models with millions of triangles at interactive rates on GPUs with 1 GB memory. Compared with existing algorithms, our algorithm is an order of magnitude more scalable for a given GPU memory bound. The second algorithm is for out-of-core bounding volume hierarchy (BVH) construction for very large scenes based on the PBFS construction order. At each iteration, all constructed nodes are dumped to the CPU memory, and the GPU memory is freed for the next iteration's use. In this way, the algorithm is able to build trees that are too large to be stored in the GPU memory. Experiments show that our algorithm can construct BVHs for scenes with up to 20 M triangles, several times larger than previous GPU algorithms.

  18. Percolator: Scalable Pattern Discovery in Dynamic Graphs

    Energy Technology Data Exchange (ETDEWEB)

    Choudhury, Sutanay; Purohit, Sumit; Lin, Peng; Wu, Yinghui; Holder, Lawrence B.; Agarwal, Khushbu


    We demonstrate Percolator, a distributed system for graph pattern discovery in dynamic graphs. In contrast to conventional mining systems, Percolator advocates efficient pattern mining schemes that (1) support pattern detection with keywords; (2) integrate incremental and parallel pattern mining; and (3) support analytical queries such as trend analysis. The core idea of Percolator is to dynamically decide and verify a small fraction of patterns and their in- stances that must be inspected in response to buffered updates in dynamic graphs, with a total mining cost independent of graph size. We demonstrate a) the feasibility of incremental pattern mining by walking through each component of Percolator, b) the efficiency and scalability of Percolator over the sheer size of real-world dynamic graphs, and c) how the user-friendly GUI of Percolator inter- acts with users to support keyword-based queries that detect, browse and inspect trending patterns. We also demonstrate two user cases of Percolator, in social media trend analysis and academic collaboration analysis, respectively.

  19. Scalable inference for stochastic block models

    KAUST Repository

    Peng, Chengbin


    Community detection in graphs is widely used in social and biological networks, and the stochastic block model is a powerful probabilistic tool for describing graphs with community structures. However, in the era of "big data," traditional inference algorithms for such a model are increasingly limited due to their high time complexity and poor scalability. In this paper, we propose a multi-stage maximum likelihood approach to recover the latent parameters of the stochastic block model, in time linear with respect to the number of edges. We also propose a parallel algorithm based on message passing. Our algorithm can overlap communication and computation, providing speedup without compromising accuracy as the number of processors grows. For example, to process a real-world graph with about 1.3 million nodes and 10 million edges, our algorithm requires about 6 seconds on 64 cores of a contemporary commodity Linux cluster. Experiments demonstrate that the algorithm can produce high quality results on both benchmark and real-world graphs. An example of finding more meaningful communities is illustrated consequently in comparison with a popular modularity maximization algorithm.

  20. Runtime volume visualization for parallel CFD (United States)

    Ma, Kwan-Liu


    This paper discusses some aspects of design of a data distributed, massively parallel volume rendering library for runtime visualization of parallel computational fluid dynamics simulations in a message-passing environment. Unlike the traditional scheme in which visualization is a postprocessing step, the rendering is done in place on each node processor. Computational scientists who run large-scale simulations on a massively parallel computer can thus perform interactive monitoring of their simulations. The current library provides an interface to handle volume data on rectilinear grids. The same design principles can be generalized to handle other types of grids. For demonstration, we run a parallel Navier-Stokes solver making use of this rendering library on the Intel Paragon XP/S. The interactive visual response achieved is found to be very useful. Performance studies show that the parallel rendering process is scalable with the size of the simulation as well as with the parallel computer.

  1. READSCAN: a fast and scalable pathogen discovery program with accurate genome relative abundance estimation. (United States)

    Naeem, Raeece; Rashid, Mamoon; Pain, Arnab


    READSCAN is a highly scalable parallel program to identify non-host sequences (of potential pathogen origin) and estimate their genome relative abundance in high-throughput sequence datasets. READSCAN accurately classified human and viral sequences on a 20.1 million reads simulated dataset in Beowulf compute cluster with 16 nodes (Supplementary Material).

  2. SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores. (United States)

    Meng, Jintao; Wang, Bingqiang; Wei, Yanjie; Feng, Shengzhong; Balaji, Pavan


    There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data. This paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets. In this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at:

  3. Scalable Techniques for Formal Verification

    CERN Document Server

    Ray, Sandip


    This book presents state-of-the-art approaches to formal verification techniques to seamlessly integrate different formal verification methods within a single logical foundation. It should benefit researchers and practitioners looking to get a broad overview of the spectrum of formal verification techniques, as well as approaches to combining such techniques within a single framework. Coverage includes a range of case studies showing how such combination is fruitful in developing a scalable verification methodology for industrial designs. This book outlines both theoretical and practical issue

  4. Parallel algorithms

    CERN Document Server

    Casanova, Henri; Robert, Yves


    ""…The authors of the present book, who have extensive credentials in both research and instruction in the area of parallelism, present a sound, principled treatment of parallel algorithms. … This book is very well written and extremely well designed from an instructional point of view. … The authors have created an instructive and fascinating text. The book will serve researchers as well as instructors who need a solid, readable text for a course on parallelism in computing. Indeed, for anyone who wants an understandable text from which to acquire a current, rigorous, and broad vi

  5. Scalable isosurface visualization of massive datasets on commodity off-the-shelf clusters (United States)

    Bajaj, Chandrajit


    Tomographic imaging and computer simulations are increasingly yielding massive datasets. Interactive and exploratory visualizations have rapidly become indispensable tools to study large volumetric imaging and simulation data. Our scalable isosurface visualization framework on commodity off-the-shelf clusters is an end-to-end parallel and progressive platform, from initial data access to the final display. Interactive browsing of extracted isosurfaces is made possible by using parallel isosurface extraction, and rendering in conjunction with a new specialized piece of image compositing hardware called Metabuffer. In this paper, we focus on the back end scalability by introducing a fully parallel and out-of-core isosurface extraction algorithm. It achieves scalability by using both parallel and out-of-core processing and parallel disks. It statically partitions the volume data to parallel disks with a balanced workload spectrum, and builds I/O-optimal external interval trees to minimize the number of I/O operations of loading large data from disk. We also describe an isosurface compression scheme that is efficient for progress extraction, transmission and storage of isosurfaces. PMID:19756231

  6. Event metadata records as a testbed for scalable data mining

    International Nuclear Information System (INIS)

    Gemmeren, P van; Malon, D


    At a data rate of 200 hertz, event metadata records ('TAGs,' in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP-specific selection or classification rules to event records and to label such an exercise 'data mining,' but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation.

  7. Highly Scalable Matching Pursuit Signal Decomposition Algorithm (United States)

    National Aeronautics and Space Administration — In this research, we propose a variant of the classical Matching Pursuit Decomposition (MPD) algorithm with significantly improved scalability and computational...

  8. Subjective comparison of temporal and quality scalability

    DEFF Research Database (Denmark)

    Korhonen, Jari; Reiter, Ulrich; You, Junyong


    be reduced either by downscaling the frame rate (temporal scalability) or the image quality (quality scalability). However, the user preferences between different scalability types are not well known in different scenarios. In this paper, we present a methodology for subjective comparison between temporal...... and quality scalability. The practical experiments with low resolution video sequences show that in general, distortion is a more crucial factor for the perceived subjective quality than frame rate. However, the results also depend on the content. Moreover,, we discuss the role of other different influence...

  9. Space Situational Awareness Data Processing Scalability Utilizing Google Cloud Services (United States)

    Greenly, D.; Duncan, M.; Wysack, J.; Flores, F.

    Space Situational Awareness (SSA) is a fundamental and critical component of current space operations. The term SSA encompasses the awareness, understanding and predictability of all objects in space. As the population of orbital space objects and debris increases, the number of collision avoidance maneuvers grows and prompts the need for accurate and timely process measures. The SSA mission continually evolves to near real-time assessment and analysis demanding the need for higher processing capabilities. By conventional methods, meeting these demands requires the integration of new hardware to keep pace with the growing complexity of maneuver planning algorithms. SpaceNav has implemented a highly scalable architecture that will track satellites and debris by utilizing powerful virtual machines on the Google Cloud Platform. SpaceNav algorithms for processing CDMs outpace conventional means. A robust processing environment for tracking data, collision avoidance maneuvers and various other aspects of SSA can be created and deleted on demand. Migrating SpaceNav tools and algorithms into the Google Cloud Platform will be discussed and the trials and tribulations involved. Information will be shared on how and why certain cloud products were used as well as integration techniques that were implemented. Key items to be presented are: 1.Scientific algorithms and SpaceNav tools integrated into a scalable architecture a) Maneuver Planning b) Parallel Processing c) Monte Carlo Simulations d) Optimization Algorithms e) SW Application Development/Integration into the Google Cloud Platform 2. Compute Engine Processing a) Application Engine Automated Processing b) Performance testing and Performance Scalability c) Cloud MySQL databases and Database Scalability d) Cloud Data Storage e) Redundancy and Availability

  10. CloudETL: Scalable Dimensional ETL for Hadoop and Hive

    DEFF Research Database (Denmark)

    Xiufeng, Liu; Thomsen, Christian; Pedersen, Torben Bach

    Extract-Transform-Load (ETL) programs process data from sources into data warehouses (DWs). Due to the rapid growth of data volumes, there is an increasing demand for systems that can scale on demand. Recently, much attention has been given to MapReduce which is a framework for highly parallel...... handling of massive data sets in cloud environments. The MapReduce-based Hive has been proposed as a DBMS-like system for DWs and provides good and scalable analytical features. It is,however, still challenging to do proper dimensional ETL processing with Hive; for example, UPDATEs are not supported which...... makes handling of slowly changing dimensions (SCDs) very difficult. To remedy this, we here present the cloud-enabled ETL framework CloudETL. CloudETL uses the open source MapReduce implementation Hadoop to parallelize the ETL execution and to process data into Hive. The user defines the ETL process...

  11. A Practical and Scalable Tool to Find Overlaps between Sequences

    Directory of Open Access Journals (Sweden)

    Maan Haj Rachid


    Full Text Available The evolution of the next generation sequencing technology increases the demand for efficient solutions, in terms of space and time, for several bioinformatics problems. This paper presents a practical and easy-to-implement solution for one of these problems, namely, the all-pairs suffix-prefix problem, using a compact prefix tree. The paper demonstrates an efficient construction of this time-efficient and space-economical tree data structure. The paper presents techniques for parallel implementations of the proposed solution. Experimental evaluation indicates superior results in terms of space and time over existing solutions. Results also show that the proposed technique is highly scalable in a parallel execution environment.

  12. Nested Parallelism: Allocation of Threads to Tasks and OpenMP Implementation


    Blikberg, Ragnhild; Sørevik, Tor


    In this paper we discuss the use of nested parallelism. Our claim is that if the problem naturally possesses multiple levels of parallelism, then applying parallelism to all levels may significantly enhance the scalability of your algorithm. This claim is sustained by numerical experiments. We also discuss how to implement multi-level parallelism using OpenMP. We find current OpenMP implementation, based on version 1.0, to have severe limitation for implementing nested parallelization. We the...

  13. Fully Scalable Porous Metal Electrospray Propulsion (United States)


    CubeSat nanosatellites . This was done mostly as an exercise on scalability as originally described in the proposal. Graphic summary of the research...Canada This is one of our papers describing systems-level scalability of electrospray propulsion in the pure ionic regime to nanosatellites . A

  14. The Node Monitoring Component of a Scalable Systems Software Environment

    Energy Technology Data Exchange (ETDEWEB)

    Miller, Samuel James [Iowa State Univ., Ames, IA (United States)


    This research describes Fountain, a suite of programs used to monitor the resources of a cluster. A cluster is a collection of individual computers that are connected via a high speed communication network. They are traditionally used by users who desire more resources, such as processing power and memory, than any single computer can provide. A common drawback to effectively utilizing such a large-scale system is the management infrastructure, which often does not often scale well as the system grows. Large-scale parallel systems provide new research challenges in the area of systems software, the programs or tools that manage the system from boot-up to running a parallel job. The approach presented in this thesis utilizes a collection of separate components that communicate with each other to achieve a common goal. While systems software comprises a broad array of components, this thesis focuses on the design choices for a node monitoring component. We will describe Fountain, an implementation of the Scalable Systems Software (SSS) node monitor specification. It is targeted at aggregate node monitoring for clusters, focusing on both scalability and fault tolerance as its design goals. It leverages widely used technologies such as XML and HTTP to present an interface to other components in the SSS environment.

  15. On eliminating synchronous communication in molecular simulations to improve scalability (United States)

    Straatsma, T. P.; Chavarría-Miranda, Daniel G.


    Molecular dynamics simulation, as a complementary tool to experimentation, has become an important methodology for the understanding and design of molecular systems as it provides access to properties that are difficult, impossible or prohibitively expensive to obtain experimentally. Many of the available software packages have been parallelized to take advantage of modern massively concurrent processing resources. The challenge in achieving parallel efficiency is commonly attributed to the fact that molecular dynamics algorithms are communication intensive. This paper illustrates how an appropriately chosen data distribution and asynchronous one-sided communication approach can be used to effectively deal with the data movement within the Global Arrays/ARMCI programming model framework. A new put_notify capability is presented here, allowing the implementation of the molecular dynamics algorithm without any explicit global or local synchronization or global data reduction operations. In addition, this push-data model is shown to very effectively allow hiding data communication behind computation. Rather than data movement or explicit global reductions, the implicit synchronization of the algorithm becomes the primary challenge for scalability. Without any explicit synchronous operations, the scalability of molecular simulations is shown to depend only on the ability to evenly balance computational load.

  16. Parallel transposition of sparse data structures

    DEFF Research Database (Denmark)

    Wang, Hao; Liu, Weifeng; Hou, Kaixi


    transposition for sparse matrices and graphs, have not received the attention they deserve. In this paper, we first identify that the transposition operation can be a bottleneck of some fundamental sparse matrix and graph algorithms. Then, we revisit the performance and scalability of parallel transposition...... approaches on x86-based multi-core and many-core processors. Based on the insights obtained, we propose two new parallel transposition algorithms: ScanTrans and MergeTrans. The experimental results show that our ScanTrans method achieves an average of 2.8-fold (up to 6.2-fold) speedup over the parallel......Many applications in computational sciences and social sciences exploit sparsity and connectivity of acquired data. Even though many parallel sparse primitives such as sparse matrix-vector (SpMV) multiplication have been extensively studied, some other important building blocks, e.g., parallel...

  17. Scalable domain decomposition solvers for stochastic PDEs in high performance computing

    International Nuclear Information System (INIS)

    Desai, Ajit; Pettit, Chris; Poirel, Dominique; Sarkar, Abhijit


    Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolution in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.

  18. An Introduction to Parallelism, Concurrency and Acceleration (1/2)

    CERN Multimedia

    CERN. Geneva


    Concurrency and parallelism are firm elements of any modern computing infrastructure, made even more prominent by the emergence of accelerators. These lectures offer an introduction to these important concepts. We will begin with a brief refresher of recent hardware offerings to modern-day programmers. We will then open the main discussion with an overview of the laws and practical aspects of scalability. Key parallelism data structures, patterns and algorithms will be shown. The main threats to scalability and mitigation strategies will be discussed in the context of real-life optimization problems.

  19. Scalable Computing for Evolutionary Genomics

    NARCIS (Netherlands)

    Prins, J.C.P.; Belhachemi, D.; Möller, S.; Smant, G.


    Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss techniques for scaling computations through parallelization of calculations, after giving

  20. SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. (United States)

    Schumacher, André; Pireddu, Luca; Niemenmaa, Matti; Kallio, Aleksi; Korpelainen, Eija; Zanetti, Gianluigi; Heljanko, Keijo


    Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig's scalability over many computing nodes and illustrate its use with example scripts. Available under the open source MIT license at

  1. Scalable Implementation of Finite Elements by NASA _ Implicit (ScIFEi) (United States)

    Warner, James E.; Bomarito, Geoffrey F.; Heber, Gerd; Hochhalter, Jacob D.


    Scalable Implementation of Finite Elements by NASA (ScIFEN) is a parallel finite element analysis code written in C++. ScIFEN is designed to provide scalable solutions to computational mechanics problems. It supports a variety of finite element types, nonlinear material models, and boundary conditions. This report provides an overview of ScIFEi (\\Sci-Fi"), the implicit solid mechanics driver within ScIFEN. A description of ScIFEi's capabilities is provided, including an overview of the tools and features that accompany the software as well as a description of the input and output le formats. Results from several problems are included, demonstrating the efficiency and scalability of ScIFEi by comparing to finite element analysis using a commercial code.

  2. Impact of multiplexed reading scheme on nanocrossbar memristor memory's scalability

    International Nuclear Information System (INIS)

    Zhu Xuan; Tang Yu-Hua; Wu Jun-Jie; Yi Xun; Wu Chun-Qing


    Nanocrossbar is a potential memory architecture to integrate memristor to achieve large scale and high density memory. However, based on the currently widely-adopted parallel reading scheme, scalability of the nanocrossbar memory is limited, since the overhead of the reading circuits is in proportion with the size of the nanocrossbar component. In this paper, a multiplexed reading scheme is adopted as the foundation of the discussion. Through HSPICE simulation, we reanalyze scalability of the nanocrossbar memristor memory by investigating the impact of various circuit parameters on the output voltage swing as the memory scales to larger size. We find that multiplexed reading maintains sufficient noise margin in large size nanocrossbar memristor memory. In order to improve the scalability of the memory, memristors with nonlinear I—V characteristics and high LRS (low resistive state) resistance should be adopted. (interdisciplinary physics and related areas of science and technology)

  3. Fast and scalable inequality joins

    KAUST Repository

    Khayyat, Zuhair


    Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as (Formula presented.)-tree, (Formula presented.)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster. © 2016 Springer-Verlag Berlin Heidelberg

  4. Scalable Gravity Offload System, Phase II (United States)

    National Aeronautics and Space Administration — The proposed innovation is a scalable gravity off-load system that enables controlled integrated testing of Surface System elements such as rovers, habitats, and...

  5. Scalable Gravity Offload System, Phase I (United States)

    National Aeronautics and Space Administration — The proposed innovation is a scalable gravity off-load system that enables controlled integrated testing of Surface System elements such as rovers, habitats, and...

  6. Architecture Knowledge for Evaluating Scalable Databases (United States)


    Tyree . "Using ontology to support development of software architectures," IBM Systems Journal, 45:4 (2006): 813-825. [18] F. de Almeida, F. Ricardo, Abstract—Designing massively scalable, highly available big data systems is an immense challenge for software architects. applications require distributed systems design principles to create scalable solutions, and the selection and adoption of open source and

  7. Complexity scalable motion-compensated temporal filtering (United States)

    Clerckx, Tom; Verdicchio, Fabio; Munteanu, Adrian; Andreopoulos, Yiannis; Devos, Harald; Eeckhaut, Hendrik; Christiaens, Mark; Stroobandt, Dirk; Verkest, Diederik; Schelkens, Peter


    Computer networks and the internet have taken an important role in modern society. Together with their development, the need for digital video transmission over these networks has grown. To cope with the user demands and limitations of the network, compression of the video material has become an important issue. Additionally, many video-applications require flexibility in terms of scalability and complexity (e.g. HD/SD-TV, video-surveillance). Current ITU-T and ISO/IEC video compression standards (MPEG-x, H.26-x) lack efficient support for these types of scalability. Wavelet-based compression techniques have been proposed to tackle this problem, of which the Motion Compensated Temporal Filtering (MCTF)-based architectures couple state-of-the-art performance with full (quality, resolution, and frame-rate) scalability. However, a significant drawback of these architectures is their high complexity. The computational and memory complexity of both spatial domain (SD) MCTF and in-band (IB) MCTF video codec instantiations are examined in this study. Comparisons in terms of complexity versus performance are presented for both types of codecs. The paper indicates how complexity scalability can be achieved in such video-codecs, and analyses some of the trade-offs between complexity and coding performance. Finally, guidelines on how to implement a fully scalable video-codec that incorporates quality, temporal, resolution and complexity scalability are proposed.

  8. Parallel R

    CERN Document Server

    McCallum, Ethan


    It's tough to argue with R as a high-quality, cross-platform, open source statistical software product-unless you're in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets. You'll learn the basics of Snow, Multicore, Parallel, and some Hadoop-related tools, including how to find them, how to use them, when they work well, and when they don't. With these packages, you can overcome R's single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R's memory barrier.

  9. Scalable Multiprocessor for High-Speed Computing in Space (United States)

    Lux, James; Lang, Minh; Nishimoto, Kouji; Clark, Douglas; Stosic, Dorothy; Bachmann, Alex; Wilkinson, William; Steffke, Richard


    A report discusses the continuing development of a scalable multiprocessor computing system for hard real-time applications aboard a spacecraft. "Hard realtime applications" signifies applications, like real-time radar signal processing, in which the data to be processed are generated at "hundreds" of pulses per second, each pulse "requiring" millions of arithmetic operations. In these applications, the digital processors must be tightly integrated with analog instrumentation (e.g., radar equipment), and data input/output must be synchronized with analog instrumentation, controlled to within fractions of a microsecond. The scalable multiprocessor is a cluster of identical commercial-off-the-shelf generic DSP (digital-signal-processing) computers plus generic interface circuits, including analog-to-digital converters, all controlled by software. The processors are computers interconnected by high-speed serial links. Performance can be increased by adding hardware modules and correspondingly modifying the software. Work is distributed among the processors in a parallel or pipeline fashion by means of a flexible master/slave control and timing scheme. Each processor operates under its own local clock; synchronization is achieved by broadcasting master time signals to all the processors, which compute offsets between the master clock and their local clocks.

  10. Scalable fast multipole methods for vortex element methods

    KAUST Repository

    Hu, Qi


    We use a particle-based method to simulate incompressible flows, where the Fast Multipole Method (FMM) is used to accelerate the calculation of particle interactions. The most time-consuming kernelsâ\\'the Biot-Savart equation and stretching term of the vorticity equationâ\\'are mathematically reformulated so that only two Laplace scalar potentials are used instead of six, while automatically ensuring divergence-free far-field computation. Based on this formulation, and on our previous work for a scalar heterogeneous FMM algorithm, we develop a new FMM-based vortex method capable of simulating general flows including turbulence on heterogeneous architectures, which distributes the work between multi-core CPUs and GPUs to best utilize the hardware resources and achieve excellent scalability. The algorithm also uses new data structures which can dynamically manage inter-node communication and load balance efficiently but with only a small parallel construction overhead. This algorithm can scale to large-sized clusters showing both strong and weak scalability. Careful error and timing trade-off analysis are also performed for the cutoff functions induced by the vortex particle method. Our implementation can perform one time step of the velocity+stretching for one billion particles on 32 nodes in 55.9 seconds, which yields 49.12 Tflop/s. © 2012 IEEE.

  11. Multiple Independent File Parallel I/O with HDF5

    Energy Technology Data Exchange (ETDEWEB)

    Miller, M. C.


    The HDF5 library has supported the I/O requirements of HPC codes at Lawrence Livermore National Labs (LLNL) since the late 90’s. In particular, HDF5 used in the Multiple Independent File (MIF) parallel I/O paradigm has supported LLNL code’s scalable I/O requirements and has recently been gainfully used at scales as large as O(106) parallel tasks.

  12. Simplifying Scalable Graph Processing with a Domain-Specific Language

    KAUST Repository

    Hong, Sungpack


    Large-scale graph processing, with its massive data sets, requires distributed processing. However, conventional frameworks for distributed graph processing, such as Pregel, use non-traditional programming models that are well-suited for parallelism and scalability but inconvenient for implementing non-trivial graph algorithms. In this paper, we use Green-Marl, a Domain-Specific Language for graph analysis, to intuitively describe graph algorithms and extend its compiler to generate equivalent Pregel implementations. Using the semantic information captured by Green-Marl, the compiler applies a set of transformation rules that convert imperative graph algorithms into Pregel\\'s programming model. Our experiments show that the Pregel programs generated by the Green-Marl compiler perform similarly to manually coded Pregel implementations of the same algorithms. The compiler is even able to generate a Pregel implementation of a complicated graph algorithm for which a manual Pregel implementation is very challenging.

  13. Scalability and communication performance of HPC on Azure Cloud

    Directory of Open Access Journals (Sweden)

    Hanan A. Hassan


    Full Text Available Different domains of research are moving to cloud computing whether to carry out compute intensive experiments or to store large datasets. Cloud computing services allow the users to focus on the application, without being concerned about the physical resources. HPC system on the cloud is desired for their high needs of efficient CPU computations. Our objective was to evaluate the scalability and performance of High Performance Cloud Computing on Microsoft Azure Cloud infrastructure by using well known Benchmarks, namely, IMB point-to-point communication and NAS Parallel Benchmarks (NPB. In our experiments, performance of the HPC applications on the cloud is assessed in terms of MOPS and speedup, and is tested under different configurations of cluster sizes. Also, point-to point communication performance between nodes is assessed in terms of latency and bandwidth as a function of message size.

  14. Overview of the Scalable Coherent Interface, IEEE STD 1596 (SCI)

    International Nuclear Information System (INIS)

    Gustavson, D.B.; James, D.V.; Wiggers, H.A.


    The Scalable Coherent Interface standard defines a new generation of interconnection that spans the full range from supercomputer memory 'bus' to campus-wide network. SCI provides bus-like services and a shared-memory software model while using an underlying, packet protocol on many independent communication links. Initially these links are 1 GByte/s (wires) and 1 GBit/s (fiber), but the protocol scales well to future faster or lower-cost technologies. The interconnect may use switches, meshes, and rings. The SCI distributed-shared-memory model is simple and versatile, enabling for the first time a smooth integration of highly parallel multiprocessors, workstations, personal computers, I/O, networking and data acquisition

  15. Parallel Lines

    Directory of Open Access Journals (Sweden)

    James G. Worner


    Full Text Available James Worner is an Australian-based writer and scholar currently pursuing a PhD at the University of Technology Sydney. His research seeks to expose masculinities lost in the shadow of Australia’s Anzac hegemony while exploring new opportunities for contemporary historiography. He is the recipient of the Doctoral Scholarship in Historical Consciousness at the university’s Australian Centre of Public History and will be hosted by the University of Bologna during 2017 on a doctoral research writing scholarship.   ‘Parallel Lines’ is one of a collection of stories, The Shapes of Us, exploring liminal spaces of modern life: class, gender, sexuality, race, religion and education. It looks at lives, like lines, that do not meet but which travel in proximity, simultaneously attracted and repelled. James’ short stories have been published in various journals and anthologies.

  16. fastBMA: scalable network inference and transitive reduction. (United States)

    Hung, Ling-Hong; Shi, Kaiyuan; Wu, Migao; Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee


    Inferring genetic networks from genome-wide expression data is extremely demanding computationally. We have developed fastBMA, a distributed, parallel, and scalable implementation of Bayesian model averaging (BMA) for this purpose. fastBMA also includes a computationally efficient module for eliminating redundant indirect edges in the network by mapping the transitive reduction to an easily solved shortest-path problem. We evaluated the performance of fastBMA on synthetic data and experimental genome-wide time series yeast and human datasets. When using a single CPU core, fastBMA is up to 100 times faster than the next fastest method, LASSO, with increased accuracy. It is a memory-efficient, parallel, and distributed application that scales to human genome-wide expression data. A 10 000-gene regulation network can be obtained in a matter of hours using a 32-core cloud cluster (2 nodes of 16 cores). fastBMA is a significant improvement over its predecessor ScanBMA. It is more accurate and orders of magnitude faster than other fast network inference methods such as the 1 based on LASSO. The improved scalability allows it to calculate networks from genome scale data in a reasonable time frame. The transitive reduction method can improve accuracy in denser networks. fastBMA is available as code (M.I.T. license) from GitHub (, as part of the updated networkBMA Bioconductor package ( and as ready-to-deploy Docker images ( © The Authors 2017. Published by Oxford University Press.

  17. Parallel programming practical aspects, models and current limitations

    CERN Document Server

    Tarkov, Mikhail S


    Parallel programming is designed for the use of parallel computer systems for solving time-consuming problems that cannot be solved on a sequential computer in a reasonable time. These problems can be divided into two classes: 1. Processing large data arrays (including processing images and signals in real time)2. Simulation of complex physical processes and chemical reactions For each of these classes, prospective methods are designed for solving problems. For data processing, one of the most promising technologies is the use of artificial neural networks. Particles-in-cell method and cellular automata are very useful for simulation. Problems of scalability of parallel algorithms and the transfer of existing parallel programs to future parallel computers are very acute now. An important task is to optimize the use of the equipment (including the CPU cache) of parallel computers. Along with parallelizing information processing, it is essential to ensure the processing reliability by the relevant organization ...

  18. A Numerical Study of Scalable Cardiac Electro-Mechanical Solvers on HPC Architectures

    Directory of Open Access Journals (Sweden)

    Piero Colli Franzone


    Full Text Available We introduce and study some scalable domain decomposition preconditioners for cardiac electro-mechanical 3D simulations on parallel HPC (High Performance Computing architectures. The electro-mechanical model of the cardiac tissue is composed of four coupled sub-models: (1 the static finite elasticity equations for the transversely isotropic deformation of the cardiac tissue; (2 the active tension model describing the dynamics of the intracellular calcium, cross-bridge binding and myofilament tension; (3 the anisotropic Bidomain model describing the evolution of the intra- and extra-cellular potentials in the deforming cardiac tissue; and (4 the ionic membrane model describing the dynamics of ionic currents, gating variables, ionic concentrations and stretch-activated channels. This strongly coupled electro-mechanical model is discretized in time with a splitting semi-implicit technique and in space with isoparametric finite elements. The resulting scalable parallel solver is based on Multilevel Additive Schwarz preconditioners for the solution of the Bidomain system and on BDDC preconditioned Newton-Krylov solvers for the non-linear finite elasticity system. The results of several 3D parallel simulations show the scalability of both linear and non-linear solvers and their application to the study of both physiological excitation-contraction cardiac dynamics and re-entrant waves in the presence of different mechano-electrical feedbacks.

  19. Parallel processing of genomics data (United States)

    Agapito, Giuseppe; Guzzi, Pietro Hiram; Cannataro, Mario


    The availability of high-throughput experimental platforms for the analysis of biological samples, such as mass spectrometry, microarrays and Next Generation Sequencing, have made possible to analyze a whole genome in a single experiment. Such platforms produce an enormous volume of data per single experiment, thus the analysis of this enormous flow of data poses several challenges in term of data storage, preprocessing, and analysis. To face those issues, efficient, possibly parallel, bioinformatics software needs to be used to preprocess and analyze data, for instance to highlight genetic variation associated with complex diseases. In this paper we present a parallel algorithm for the parallel preprocessing and statistical analysis of genomics data, able to face high dimension of data and resulting in good response time. The proposed system is able to find statistically significant biological markers able to discriminate classes of patients that respond to drugs in different ways. Experiments performed on real and synthetic genomic datasets show good speed-up and scalability.

  20. Evaluating the scalability of HEP software and multi-core hardware

    International Nuclear Information System (INIS)

    Jarp, Sverre; Lazzaro, Alfio; Leduc, Julien; Nowak, Andrzej


    As researchers have reached the practical limits of processor performance improvements by frequency scaling, it is clear that the future of computing lies in the effective utilization of parallel and multi-core architectures. Since this significant change in computing is well underway, it is vital for HEP programmers to understand the scalability of their software on modern hardware and the opportunities for potential improvements. This work aims to quantify the benefit of new mainstream architectures to the HEP community through practical benchmarking on recent hardware solutions, including the usage of parallelized HEP applications.

  1. The Concept of Business Model Scalability

    DEFF Research Database (Denmark)

    Lund, Morten; Nielsen, Christian


    -term pro table business. However, the main message of this article is that while providing a good value proposition may help the rm ‘get by’, the really successful businesses of today are those able to reach the sweet-spot of business model scalability. Design/Methodology/Approach: The article is based......Purpose: The purpose of the article is to de ne what scalable business models are. Central to the contemporary understanding of business models is the value proposition towards the customer and the hypotheses generated about delivering value to the customer which become a good foundation for a long...... on a ve-year longitudinal action research project of over 90 companies that participated in the International Center for Innovation project aimed at building 10 global network-based business models. Findings: This article introduces and discusses the term scalability from a company-level perspective...

  2. Software performance and scalability a quantitative approach

    CERN Document Server

    Liu, Henry H


    Praise from the Reviewers:"The practicality of the subject in a real-world situation distinguishes this book from othersavailable on the market."—Professor Behrouz Far, University of Calgary"This book could replace the computer organization texts now in use that every CS and CpEstudent must take. . . . It is much needed, well written, and thoughtful."—Professor Larry Bernstein, Stevens Institute of TechnologyA distinctive, educational text onsoftware performance and scalabilityThis is the first book to take a quantitative approach to the subject of software performance and scalability

  3. Enhancing data parallel aplications with task parallelism


    Fernández, Jacqueline; Guerrero, Roberto A.; Piccoli, María Fabiana; Printista, Alicia Marcela; Villalobos, M.


    Most parallel applications contain data parallelism and almost all discussion of its solutions has limited to the simplest and least expressive form: flat data parallelism. Several generalization of the flat data parallel model have been proposed because a large number of those applications need a combination of task and data parallelism to represent their natural computation structure and to achieve good performance in their results. Their aim is to allow the capability of combining the easi...

  4. Content-Aware Scalability-Type Selection for Rate Adaptation of Scalable Video

    Directory of Open Access Journals (Sweden)

    Tekalp A Murat


    Full Text Available Scalable video coders provide different scaling options, such as temporal, spatial, and SNR scalabilities, where rate reduction by discarding enhancement layers of different scalability-type results in different kinds and/or levels of visual distortion depend on the content and bitrate. This dependency between scalability type, video content, and bitrate is not well investigated in the literature. To this effect, we first propose an objective function that quantifies flatness, blockiness, blurriness, and temporal jerkiness artifacts caused by rate reduction by spatial size, frame rate, and quantization parameter scaling. Next, the weights of this objective function are determined for different content (shot types and different bitrates using a training procedure with subjective evaluation. Finally, a method is proposed for choosing the best scaling type for each temporal segment that results in minimum visual distortion according to this objective function given the content type of temporal segments. Two subjective tests have been performed to validate the proposed procedure for content-aware selection of the best scalability type on soccer videos. Soccer videos scaled from 600 kbps to 100 kbps by the proposed content-aware selection of scalability type have been found visually superior to those that are scaled using a single scalability option over the whole sequence.

  5. A parallel computational model for GATE simulations. (United States)

    Rannou, F R; Vega-Acevedo, N; El Bitar, Z


    GATE/Geant4 Monte Carlo simulations are computationally demanding applications, requiring thousands of processor hours to produce realistic results. The classical strategy of distributing the simulation of individual events does not apply efficiently for Positron Emission Tomography (PET) experiments, because it requires a centralized coincidence processing and large communication overheads. We propose a parallel computational model for GATE that handles event generation and coincidence processing in a simple and efficient way by decentralizing event generation and processing but maintaining a centralized event and time coordinator. The model is implemented with the inclusion of a new set of factory classes that can run the same executable in sequential or parallel mode. A Mann-Whitney test shows that the output produced by this parallel model in terms of number of tallies is equivalent (but not equal) to its sequential counterpart. Computational performance evaluation shows that the software is scalable and well balanced. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  6. Asynchronous Parallelization of a CFD Solver

    Directory of Open Access Journals (Sweden)

    Daniel S. Abdi


    Full Text Available A Navier-Stokes equations solver is parallelized to run on a cluster of computers using the domain decomposition method. Two approaches of communication and computation are investigated, namely, synchronous and asynchronous methods. Asynchronous communication between subdomains is not commonly used in CFD codes; however, it has a potential to alleviate scaling bottlenecks incurred due to processors having to wait for each other at designated synchronization points. A common way to avoid this idle time is to overlap asynchronous communication with computation. For this to work, however, there must be something useful and independent a processor can do while waiting for messages to arrive. We investigate an alternative approach of computation, namely, conducting asynchronous iterations to improve local subdomain solution while communication is in progress. An in-house CFD code is parallelized using message passing interface (MPI, and scalability tests are conducted that suggest asynchronous iterations are a viable way of parallelizing CFD code.

  7. Task Oriented Tools for Information Retrieval (United States)

    Yang, Peilin


    Information Retrieval (IR) is one of the most evolving research fields and has drawn extensive attention in recent years. Because of its empirical nature, the advance of the IR field is closely related to the development of various toolkits. While the traditional IR toolkit mainly provides a platform to evaluate the effectiveness of retrieval…

  8. Scalable Boson Sampling with Noisy Components (United States)

    Keating, Tyler; Slote, Joseph; Muraleedharan, Gopikrishnan; Carrasco, Ezequiel; Deutsch, Ivan

    The goal of a Boson Sampler is to efficiently and scalably sample from a probability distribution that cannot be simulated efficiently on a classical computer, thus violating the Extended Church-Turing Thesis (ECTT). To properly falsify the ECTT, the physical device must do so even in the face of realistic noise. Scaling a Boson Sampler requires increasing quantities of a set of fixed-size components (beamsplitters, detectors, etc.), so it is natural to consider noise models that act on each component independently. We show that for any such model, the per-component noise need only decrease polynomially to keep the sampling problem hard. In this sense, Boson Sampling with noise is scalable. However, the same result applies to a number of other quantum information systems, including universal circuit-model quantum computers. Such devices are widely believed to require error correction in order to be truly scalable, even though polynomial reduction of per-component errors would allow them to work without error correction. This belief is consistent with the stricter requirement that error rates should be not just polynomially small, but constant in problem size. We conclude that a more precise definition of scalability with noise is needed to properly evaluate Boson Samplers.

  9. Performance of scalable coding in depth domain (United States)

    Sjöström, Mårten; Karlsson, Linda S.


    Common autostereoscopic 3D displays are based on multi-view projection. The diversity of resolutions and number of views of such displays implies a necessary flexibility of 3D content formats in order to make broadcasting efficient. Furthermore, distribution of content over a heterogeneous network should adapt to an available network capacity. Present scalable video coding provides the ability to adapt to network conditions; it allows for quality, temporal and spatial scaling of 2D video. Scalability for 3D data extends this list to the depth and the view domains. We have introduced scalability with respect to depth information. Our proposed scheme is based on the multi-view-plus-depth format where the center view data are preserved, and side views are extracted in enhancement layers depending on depth values. We investigate the performance of various layer assignment strategies: number of layers, and distribution of layers in depth, either based on equal number of pixels or histogram characteristics. We further consider the consequences to variable distortion due to encoder parameters. The results are evaluated considering their overall distortion verses bit rate, distortion per enhancement layer, as well as visual quality appearance. Scalability with respect to depth (and views) allows for an increased number of quality steps; the cost is a slight increase of required capacity for the whole sequence. The main advantage is, however, an improved quality for objects close to the viewer, even if overall quality is worse.

  10. Using scalable vector graphics to evolve art

    NARCIS (Netherlands)

    den Heijer, E.; Eiben, A. E.


    In this paper, we describe our investigations of the use of scalable vector graphics as a genotype representation in evolutionary art. We describe the technical aspects of using SVG in evolutionary art, and explain our custom, SVG specific operators initialisation, mutation and crossover. We perform

  11. Ubicrawler: a scalable fully distributed web crawler


    Codenotti, Bruno


    We present the design and implementation of UbiCrawler, a scalable distributed web crawler, and we analyze its performance. The main features of UbiCrawler are platform independence, fault tolerance, a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task.

  12. Declarative and Scalable Selection for Map Visualizations

    DEFF Research Database (Denmark)

    Kefaloukos, Pimin Konstantin Balic

    supports the PostgreSQL dialect of SQL. The prototype implementation is a compiler that translates CVL into SQL and stored procedures. (c) TileHeat is a framework and basic algorithm for partial materialization of hot tile sets for scalable map distribution. The framework predicts future map workloads...

  13. Realization of a scalable airborne radar

    NARCIS (Netherlands)

    Otten, M.P.G.; Vermeulen, B.C.B.; Liempt, L.J. van; Halsema, D. van; Jongh, R.V. de; Es, J. van


    Modern airborne ground surveillance radar systems are increasingly based on Active Electronically Scanned Array (AESA) antennas. Efficient use of array technology and the need for radar solutions for various airborne platforms, manned and unmanned, leads to the design of scalable radar systems. The

  14. Scalable Detection and Isolation of Phishing

    NARCIS (Netherlands)

    Moreira Moura, Giovane; Pras, Aiko


    This paper presents a proposal for scalable detection and isolation of phishing. The main ideas are to move the protection from end users towards the network provider and to employ the novel bad neighborhood concept, in order to detect and isolate both phishing e-mail senders and phishing web

  15. Scalable Open Source Smart Grid Simulator (SGSim)

    DEFF Research Database (Denmark)

    Ebeid, Emad Samuel Malki; Jacobsen, Rune Hylsberg; Stefanni, Francesco


    The future smart power grid will consist of an unlimited number of smart devices that communicate with control units to maintain the grid’s sustainability, efficiency, and balancing. In order to build and verify such controllers over a large grid, a scalable simulation environment is needed...

  16. Scalable Design of Paired CRISPR Guide RNAs for Genomic Deletion.

    Directory of Open Access Journals (Sweden)

    Carlos Pulido-Quetglas


    Full Text Available CRISPR-Cas9 technology can be used to engineer precise genomic deletions with pairs of single guide RNAs (sgRNAs. This approach has been widely adopted for diverse applications, from disease modelling of individual loci, to parallelized loss-of-function screens of thousands of regulatory elements. However, no solution has been presented for the unique bioinformatic design requirements of CRISPR deletion. We here present CRISPETa, a pipeline for flexible and scalable paired sgRNA design based on an empirical scoring model. Multiple sgRNA pairs are returned for each target, and any number of targets can be analyzed in parallel, making CRISPETa equally useful for focussed or high-throughput studies. Fast run-times are achieved using a pre-computed off-target database. sgRNA pair designs are output in a convenient format for visualisation and oligonucleotide ordering. We present pre-designed, high-coverage library designs for entire classes of protein-coding and non-coding elements in human, mouse, zebrafish, Drosophila melanogaster and Caenorhabditis elegans. In human cells, we reproducibly observe deletion efficiencies of ≥50% for CRISPETa designs targeting an enhancer and exonic fragment of the MALAT1 oncogene. In the latter case, deletion results in production of desired, truncated RNA. CRISPETa will be useful for researchers seeking to harness CRISPR for targeted genomic deletion, in a variety of model organisms, from single-target to high-throughput scales.

  17. SAChES: Scalable Adaptive Chain-Ensemble Sampling.

    Energy Technology Data Exchange (ETDEWEB)

    Swiler, Laura Painton [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Ray, Jaideep [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Ebeida, Mohamed Salah [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Huang, Maoyi [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Hou, Zhangshuan [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Bao, Jie [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Ren, Huiying [Pacific Northwest National Lab. (PNNL), Richland, WA (United States)


    We present the development of a parallel Markov Chain Monte Carlo (MCMC) method called SAChES, Scalable Adaptive Chain-Ensemble Sampling. This capability is targed to Bayesian calibration of com- putationally expensive simulation models. SAChES involves a hybrid of two methods: Differential Evo- lution Monte Carlo followed by Adaptive Metropolis. Both methods involve parallel chains. Differential evolution allows one to explore high-dimensional parameter spaces using loosely coupled (i.e., largely asynchronous) chains. Loose coupling allows the use of large chain ensembles, with far more chains than the number of parameters to explore. This reduces per-chain sampling burden, enables high-dimensional inversions and the use of computationally expensive forward models. The large number of chains can also ameliorate the impact of silent-errors, which may affect only a few chains. The chain ensemble can also be sampled to provide an initial condition when an aberrant chain is re-spawned. Adaptive Metropolis takes the best points from the differential evolution and efficiently hones in on the poste- rior density. The multitude of chains in SAChES is leveraged to (1) enable efficient exploration of the parameter space; and (2) ensure robustness to silent errors which may be unavoidable in extreme-scale computational platforms of the future. This report outlines SAChES, describes four papers that are the result of the project, and discusses some additional results.

  18. Performance Evaluation of the WSN Routing Protocols Scalability

    Directory of Open Access Journals (Sweden)

    L. Alazzawi


    Full Text Available Scalability is an important factor in designing an efficient routing protocol for wireless sensor networks (WSNs. A good routing protocol has to be scalable and adaptive to the changes in the network topology. Thus scalable protocol should perform well as the network grows larger or as the workload increases. In this paper, routing protocols for wireless sensor networks are simulated and their performances are evaluated to determine their capability for supporting network scalability.

  19. Analysis of scalability of high-performance 3D image processing platform for virtual colonoscopy. (United States)

    Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli


    One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. For this purpose, we previously developed a software platform for high-performance 3D medical image processing, called HPC 3D-MIP platform, which employs increasingly available and affordable commodity computing systems such as the multicore, cluster, and cloud computing systems. To achieve scalable high-performance computing, the platform employed size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D-MIP algorithms, supported task scheduling for efficient load distribution and balancing, and consisted of a layered parallel software libraries that allow image processing applications to share the common functionalities. We evaluated the performance of the HPC 3D-MIP platform by applying it to computationally intensive processes in virtual colonoscopy. Experimental results showed a 12-fold performance improvement on a workstation with 12-core CPUs over the original sequential implementation of the processes, indicating the efficiency of the platform. Analysis of performance scalability based on the Amdahl's law for symmetric multicore chips showed the potential of a high performance scalability of the HPC 3D-MIP platform when a larger number of cores is available.

  20. The P-Mesh: A Commodity-based Scalable Network Architecture for Clusters (United States)

    Nitzberg, Bill; Kuszmaul, Chris; Stockdale, Ian; Becker, Jeff; Jiang, John; Wong, Parkson; Tweten, David (Technical Monitor)


    We designed a new network architecture, the P-Mesh which combines the scalability and fault resilience of a torus with the performance of a switch. We compare the scalability, performance, and cost of the hub, switch, torus, tree, and P-Mesh architectures. The latter three are capable of scaling to thousands of nodes, however, the torus has severe performance limitations with that many processors. The tree and P-Mesh have similar latency, bandwidth, and bisection bandwidth, but the P-Mesh outperforms the switch architecture (a lower bound for tree performance) on 16-node NAB Parallel Benchmark tests by up to 23%, and costs 40% less. Further, the P-Mesh has better fault resilience characteristics. The P-Mesh architecture trades increased management overhead for lower cost, and is a good bridging technology while the price of tree uplinks is expensive.

  1. Declarative and Scalable Selection for Map Visualizations

    DEFF Research Database (Denmark)

    Kefaloukos, Pimin Konstantin Balic

    supports the PostgreSQL dialect of SQL. The prototype implementation is a compiler that translates CVL into SQL and stored procedures. (c) TileHeat is a framework and basic algorithm for partial materialization of hot tile sets for scalable map distribution. The framework predicts future map workloads......, there are indications that the method is scalable for databases that contain millions of records, especially if the target language of the compiler is substituted by a cluster-ready variant of SQL. While several realistic use cases for maps have been implemented in CVL, additional non-geographic data visualization uses...... goal. The results for Tileheat show that the prediction method offers a substantial improvement over the current method used by the Danish Geodata Agency. Thus, a large amount of computations can potentially be saved by this public institution, who is responsible for the distribution of government...

  2. From Digital Disruption to Business Model Scalability

    DEFF Research Database (Denmark)

    Nielsen, Christian; Lund, Morten; Thomsen, Peter Poulsen


    as a response to digital disruption. A series of case studies illustrate that besides frequent existing messages in the business literature relating to the importance of creating agile businesses, both in growing and declining economies, as well as hard to copy value propositions or value propositions that take...... a long time to replicate, business model scalability can be cornered into four dimensions. In many corporate restructuring exercises and Mergers and Acquisitions there is a tendency to look for synergies in the form of cost reductions, lean workflows and market segments. However, this state of mind......This article discusses the terms disruption, digital disruption, business models and business model scalability. It illustrates how managers should be using these terms for the benefit of their business by developing business models capable of achieving exponentially increasing returns to scale...

  3. Scalable fabrication of perovskite solar cells

    Energy Technology Data Exchange (ETDEWEB)

    Li, Zhen; Klein, Talysa R.; Kim, Dong Hoe; Yang, Mengjin; Berry, Joseph J.; van Hest, Maikel F. A. M.; Zhu, Kai


    Perovskite materials use earth-abundant elements, have low formation energies for deposition and are compatible with roll-to-roll and other high-volume manufacturing techniques. These features make perovskite solar cells (PSCs) suitable for terawatt-scale energy production with low production costs and low capital expenditure. Demonstrations of performance comparable to that of other thin-film photovoltaics (PVs) and improvements in laboratory-scale cell stability have recently made scale up of this PV technology an intense area of research focus. Here, we review recent progress and challenges in scaling up PSCs and related efforts to enable the terawatt-scale manufacturing and deployment of this PV technology. We discuss common device and module architectures, scalable deposition methods and progress in the scalable deposition of perovskite and charge-transport layers. We also provide an overview of device and module stability, module-level characterization techniques and techno-economic analyses of perovskite PV modules.

  4. A Scalability Model for ECS's Data Server (United States)

    Menasce, Daniel A.; Singhal, Mukesh


    This report presents in four chapters a model for the scalability analysis of the Data Server subsystem of the Earth Observing System Data and Information System (EOSDIS) Core System (ECS). The model analyzes if the planned architecture of the Data Server will support an increase in the workload with the possible upgrade and/or addition of processors, storage subsystems, and networks. The approaches in the report include a summary of the architecture of ECS's Data server as well as a high level description of the Ingest and Retrieval operations as they relate to ECS's Data Server. This description forms the basis for the development of the scalability model of the data server and the methodology used to solve it.

  5. Quantifying XRootD scalability and overheads

    International Nuclear Information System (INIS)

    De Witt, S; Lahiff, A


    Both ATLAS and CMS experiments are making increasing use of the XRootD architecture to provide access to data not held locally to where a job is running. The anticipation is that this will lead to fewer job failures, although the efficiency of jobs that would otherwise have failed may be reduced. In this paper we look at the overhead and scalability of the XRootD software system, and the overhead of the infrastructure needed to support remote access to data.

  6. Scalable Density-Based Subspace Clustering

    DEFF Research Database (Denmark)

    Müller, Emmanuel; Assent, Ira; Günnemann, Stephan


    method that steers mining to few selected subspace clusters. Our novel steering technique reduces subspace processing by identifying and clustering promising subspaces and their combinations directly. Thereby, it narrows down the search space while maintaining accuracy. Thorough experiments on real...... and synthetic databases show that steering is efficient and scalable, with high quality results. For future work, our steering paradigm for density-based subspace clustering opens research potential for speeding up other subspace clustering approaches as well....

  7. A scalable variational inequality approach for flow through porous media models with pressure-dependent viscosity (United States)

    Mapakshi, N. K.; Chang, J.; Nakshatrala, K. B.


    Mathematical models for flow through porous media typically enjoy the so-called maximum principles, which place bounds on the pressure field. It is highly desirable to preserve these bounds on the pressure field in predictive numerical simulations, that is, one needs to satisfy discrete maximum principles (DMP). Unfortunately, many of the existing formulations for flow through porous media models do not satisfy DMP. This paper presents a robust, scalable numerical formulation based on variational inequalities (VI), to model non-linear flows through heterogeneous, anisotropic porous media without violating DMP. VI is an optimization technique that places bounds on the numerical solutions of partial differential equations. To crystallize the ideas, a modification to Darcy equations by taking into account pressure-dependent viscosity will be discretized using the lowest-order Raviart-Thomas (RT0) and Variational Multi-scale (VMS) finite element formulations. It will be shown that these formulations violate DMP, and, in fact, these violations increase with an increase in anisotropy. It will be shown that the proposed VI-based formulation provides a viable route to enforce DMP. Moreover, it will be shown that the proposed formulation is scalable, and can work with any numerical discretization and weak form. A series of numerical benchmark problems are solved to demonstrate the effects of heterogeneity, anisotropy and non-linearity on DMP violations under the two chosen formulations (RT0 and VMS), and that of non-linearity on solver convergence for the proposed VI-based formulation. Parallel scalability on modern computational platforms will be illustrated through strong-scaling studies, which will prove the efficiency of the proposed formulation in a parallel setting. Algorithmic scalability as the problem size is scaled up will be demonstrated through novel static-scaling studies. The performed static-scaling studies can serve as a guide for users to be able to select

  8. DISP: Optimizations towards Scalable MPI Startup

    Energy Technology Data Exchange (ETDEWEB)

    Fu, Huansong [Florida State University, Tallahassee; Pophale, Swaroop S [ORNL; Gorentla Venkata, Manjunath [ORNL; Yu, Weikuan [Florida State University, Tallahassee


    Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.

  9. Scalable robotic biofabrication of tissue spheroids

    Energy Technology Data Exchange (ETDEWEB)

    Mehesz, A Nagy; Hajdu, Z; Visconti, R P; Markwald, R R; Mironov, V [Advanced Tissue Biofabrication Center, Department of Regenerative Medicine and Cell Biology, Medical University of South Carolina, Charleston, SC (United States); Brown, J [Department of Mechanical Engineering, Clemson University, Clemson, SC (United States); Beaver, W [York Technical College, Rock Hill, SC (United States); Da Silva, J V L, E-mail: [Renato Archer Information Technology Center-CTI, Campinas (Brazil)


    Development of methods for scalable biofabrication of uniformly sized tissue spheroids is essential for tissue spheroid-based bioprinting of large size tissue and organ constructs. The most recent scalable technique for tissue spheroid fabrication employs a micromolded recessed template prepared in a non-adhesive hydrogel, wherein the cells loaded into the template self-assemble into tissue spheroids due to gravitational force. In this study, we present an improved version of this technique. A new mold was designed to enable generation of 61 microrecessions in each well of a 96-well plate. The microrecessions were seeded with cells using an EpMotion 5070 automated pipetting machine. After 48 h of incubation, tissue spheroids formed at the bottom of each microrecession. To assess the quality of constructs generated using this technology, 600 tissue spheroids made by this method were compared with 600 spheroids generated by the conventional hanging drop method. These analyses showed that tissue spheroids fabricated by the micromolded method are more uniform in diameter. Thus, use of micromolded recessions in a non-adhesive hydrogel, combined with automated cell seeding, is a reliable method for scalable robotic fabrication of uniform-sized tissue spheroids.

  10. A Scalable Gaussian Process Analysis Algorithm for Biomass Monitoring

    Energy Technology Data Exchange (ETDEWEB)

    Chandola, Varun [ORNL; Vatsavai, Raju [ORNL


    Biomass monitoring is vital for studying the carbon cycle of earth's ecosystem and has several significant implications, especially in the context of understanding climate change and its impacts. Recently, several change detection methods have been proposed to identify land cover changes in temporal profiles (time series) of vegetation collected using remote sensing instruments, but do not satisfy one or both of the two requirements of the biomass monitoring problem, i.e., {\\em operating in online mode} and {\\em handling periodic time series}. In this paper, we adapt Gaussian process regression to detect changes in such time series in an online fashion. While Gaussian process (GP) have been widely used as a kernel based learning method for regression and classification, their applicability to massive spatio-temporal data sets, such as remote sensing data, has been limited owing to the high computational costs involved. We focus on addressing the scalability issues associated with the proposed GP based change detection algorithm. This paper makes several significant contributions. First, we propose a GP based online time series change detection algorithm and demonstrate its effectiveness in detecting different types of changes in {\\em Normalized Difference Vegetation Index} (NDVI) data obtained from a study area in Iowa, USA. Second, we propose an efficient Toeplitz matrix based solution which significantly improves the computational complexity and memory requirements of the proposed GP based method. Specifically, the proposed solution can analyze a time series of length $t$ in $O(t^2)$ time while maintaining a $O(t)$ memory footprint, compared to the $O(t^3)$ time and $O(t^2)$ memory requirement of standard matrix manipulation based methods. Third, we describe a parallel version of the proposed solution which can be used to simultaneously analyze a large number of time series. We study three different parallel implementations: using threads, MPI, and a

  11. Numeric Analysis for Relationship-Aware Scalable Streaming Scheme

    Directory of Open Access Journals (Sweden)

    Heung Ki Lee


    Full Text Available Frequent packet loss of media data is a critical problem that degrades the quality of streaming services over mobile networks. Packet loss invalidates frames containing lost packets and other related frames at the same time. Indirect loss caused by losing packets decreases the quality of streaming. A scalable streaming service can decrease the amount of dropped multimedia resulting from a single packet loss. Content providers typically divide one large media stream into several layers through a scalable streaming service and then provide each scalable layer to the user depending on the mobile network. Also, a scalable streaming service makes it possible to decode partial multimedia data depending on the relationship between frames and layers. Therefore, a scalable streaming service provides a way to decrease the wasted multimedia data when one packet is lost. However, the hierarchical structure between frames and layers of scalable streams determines the service quality of the scalable streaming service. Even if whole packets of layers are transmitted successfully, they cannot be decoded as a result of the absence of reference frames and layers. Therefore, the complicated relationship between frames and layers in a scalable stream increases the volume of abandoned layers. For providing a high-quality scalable streaming service, we choose a proper relationship between scalable layers as well as the amount of transmitted multimedia data depending on the network situation. We prove that a simple scalable scheme outperforms a complicated scheme in an error-prone network. We suggest an adaptive set-top box (AdaptiveSTB to lower the dependency between scalable layers in a scalable stream. Also, we provide a numerical model to obtain the indirect loss of multimedia data and apply it to various multimedia streams. Our AdaptiveSTB enhances the quality of a scalable streaming service by removing indirect loss.

  12. SMARTS: Exploiting Temporal Locality and Parallelism through Vertical Execution

    International Nuclear Information System (INIS)

    Beckman, P.; Crotinger, J.; Karmesin, S.; Malony, A.; Oldehoeft, R.; Shende, S.; Smith, S.; Vajracharya, S.


    In the solution of large-scale numerical prob- lems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely com- plex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of manag- ing parallelism and data locality from the user. We present innovative algorithms, based on the macro -dataflow model, for detecting data parallelism and efficiently executing data- parallel statements on shared-memory multiprocessors. We also desclibe how these algorithms can be implemented on clusters of SMPS

  13. Parallel science and engineering applications the charm++ approach

    CERN Document Server

    Kale, Laxmikant V


    Developed in the context of science and engineering applications, with each abstraction motivated by and further honed by specific application needs, Charm++ is a production-quality system that runs on almost all parallel computers available. Parallel Science and Engineering Applications: The Charm++ Approach surveys a diverse and scalable collection of science and engineering applications, most of which are used regularly on supercomputers by scientists to further their research. After a brief introduction to Charm++, the book presents several parallel CSE codes written in the Charm++ model, along with their underlying scientific and numerical formulations, explaining their parallelization strategies and parallel performance. These chapters demonstrate the versatility of Charm++ and its utility for a wide variety of applications, including molecular dynamics, cosmology, quantum chemistry, fracture simulations, agent-based simulations, and weather modeling. The book is intended for a wide audience of people i...

  14. Parallel ray tracing for one-dimensional discrete ordinate computations

    International Nuclear Information System (INIS)

    Jarvis, R.D.; Nelson, P.


    The ray-tracing sweep in discrete-ordinates, spatially discrete numerical approximation methods applied to the linear, steady-state, plane-parallel, mono-energetic, azimuthally symmetric, neutral-particle transport equation can be reduced to a parallel prefix computation. In so doing, the often severe penalty in convergence rate of the source iteration, suffered by most current parallel algorithms using spatial domain decomposition, can be avoided while attaining parallelism in the spatial domain to whatever extent desired. In addition, the reduction implies parallel algorithm complexity limits for the ray-tracing sweep. The reduction applies to all closed, linear, one-cell functional (CLOF) spatial approximation methods, which encompasses most in current popular use. Scalability test results of an implementation of the algorithm on a 64-node nCube-2S hypercube-connected, message-passing, multi-computer are described. (author)

  15. Parallel science and engineering applications the Charm++ approach

    CERN Document Server

    Kale, Laxmikant V


    Developed in the context of science and engineering applications, with each abstraction motivated by and further honed by specific application needs, Charm++ is a production-quality system that runs on almost all parallel computers available. Parallel Science and Engineering Applications: The Charm++ Approach surveys a diverse and scalable collection of science and engineering applications, most of which are used regularly on supercomputers by scientists to further their research. After a brief introduction to Charm++, the book presents several parallel CSE codes written in the Charm++ model, along with their underlying scientific and numerical formulations, explaining their parallelization strategies and parallel performance. These chapters demonstrate the versatility of Charm++ and its utility for a wide variety of applications, including molecular dynamics, cosmology, quantum chemistry, fracture simulations, agent-based simulations, and weather modeling. The book is intended for a wide audience of people i...

  16. Petascale Parallelization of the Gyrokinetic Toroidal Code

    Energy Technology Data Exchange (ETDEWEB)

    Ethier, Stephane; Adams, Mark; Carter, Jonathan; Oliker, Leonid


    The Gyrokinetic Toroidal Code (GTC) is a global, three-dimensional particle-in-cell application developed to study microturbulence in tokamak fusion devices. The global capability of GTC is unique, allowing researchers to systematically analyze important dynamics such as turbulence spreading. In this work we examine a new radial domain decomposition approach to allow scalability onto the latest generation of petascale systems. Extensive performance evaluation is conducted on three high performance computing systems: the IBM BG/P, the Cray XT4, and an Intel Xeon Cluster. Overall results show that the radial decomposition approach dramatically increases scalability, while reducing the memory footprint - allowing for fusion device simulations at an unprecedented scale. After a decade where high-end computing (HEC) was dominated by the rapid pace of improvements to processor frequencies, the performance of next-generation supercomputers is increasingly differentiated by varying interconnect designs and levels of integration. Understanding the tradeoffs of these system designs is a key step towards making effective petascale computing a reality. In this work, we examine a new parallelization scheme for the Gyrokinetic Toroidal Code (GTC) [?] micro-turbulence fusion application. Extensive scalability results and analysis are presented on three HEC systems: the IBM BlueGene/P (BG/P) at Argonne National Laboratory, the Cray XT4 at Lawrence Berkeley National Laboratory, and an Intel Xeon cluster at Lawrence Livermore National Laboratory. Overall results indicate that the new radial decomposition approach successfully attains unprecedented scalability to 131,072 BG/P cores by overcoming the memory limitations of the previous approach. The new version is well suited to utilize emerging petascale resources to access new regimes of physical phenomena.

  17. Scalability Dilemma and Statistic Multiplexed Computing — A Theory and Experiment

    Directory of Open Access Journals (Sweden)

    Justin Yuan Shi


    Full Text Available The For the last three decades, end-to-end computing paradigms, such as MPI (Message Passing Interface, RPC (Remote Procedure Call and RMI (Remote Method Invocation, have been the de facto paradigms for distributed and parallel programming. Despite of the successes, applications built using these paradigms suffer due to the proportionality factor of crash in the application with its size. Checkpoint/restore and backup/recovery are the only means to save otherwise lost critical information. The scalability dilemma is such a practical challenge that the probability of the data losses increases as the application scales in size. The theoretical significance of this practical challenge is that it undermines the fundamental structure of the scientific discovery process and mission critical services in production today. In 1997, the direct use of end-to-end reference model in distributed programming was recognized as a fallacy. The scalability dilemma was predicted. However, this voice was overrun by the passage of time. Today, the rapidly growing digitized data demands solving the increasingly critical scalability challenges. Computing architecture scalability, although loosely defined, is now the front and center of large-scale computing efforts. Constrained only by the economic law of diminishing returns, this paper proposes a narrow definition of a Scalable Computing Service (SCS. Three scalability tests are also proposed in order to distinguish service architecture flaws from poor application programming. Scalable data intensive service requires additional treatments. Thus, the data storage is assumed reliable in this paper. A single-sided Statistic Multiplexed Computing (SMC paradigm is proposed. A UVR (Unidirectional Virtual Ring SMC architecture is examined under SCS tests. SMC was designed to circumvent the well-known impossibility of end-to-end paradigms. It relies on the proven statistic multiplexing principle to deliver reliable service

  18. MPI-hybrid Parallelism for Volume Rendering on Large, Multi-core Systems

    Energy Technology Data Exchange (ETDEWEB)

    Howison, Mark; Bethel, E. Wes; Childs, Hank


    This work studies the performance and scalability characteristics of"hybrid'" parallel programming and execution as applied to raycasting volume rendering -- a staple visualization algorithm -- on a large, multi-core platform. Historically, the Message Passing Interface (MPI) has become the de-facto standard for parallel programming and execution on modern parallel systems. As the computing industry trends towards multi-core processors, with four- and six-core chips common today and 128-core chips coming soon, we wish to better understand how algorithmic and parallel programming choices impact performance and scalability on large, distributed-memory multi-core systems. Our findings indicate that the hybrid-parallel implementation, at levels of concurrency ranging from 1,728 to 216,000, performs better, uses a smaller absolute memory footprint, and consumes less communication bandwidth than the traditional, MPI-only implementation.

  19. Parallel Programming with Intel Parallel Studio XE

    CERN Document Server

    Blair-Chappell , Stephen


    Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the

  20. A task parallel implementation of fast multipole methods

    KAUST Repository

    Taura, Kenjiro


    This paper describes a task parallel implementation of ExaFMM, an open source implementation of fast multipole methods (FMM), using a lightweight task parallel library MassiveThreads. Although there have been many attempts on parallelizing FMM, experiences have almost exclusively been limited to formulation based on flat homogeneous parallel loops. FMM in fact contains operations that cannot be readily expressed in such conventional but restrictive models. We show that task parallelism, or parallel recursions in particular, allows us to parallelize all operations of FMM naturally and scalably. Moreover it allows us to parallelize a \\'\\'mutual interaction\\'\\' for force/potential evaluation, which is roughly twice as efficient as a more conventional, unidirectional force/potential evaluation. The net result is an open source FMM that is clearly among the fastest single node implementations, including those on GPUs; with a million particles on a 32 cores Sandy Bridge 2.20GHz node, it completes a single time step including tree construction and force/potential evaluation in 65 milliseconds. The study clearly showcases both programmability and performance benefits of flexible parallel constructs over more monolithic parallel loops. © 2012 IEEE.

  1. Parallel Auxiliary Space AMG Solver for $H(div)$ Problems

    Energy Technology Data Exchange (ETDEWEB)

    Kolev, Tzanio V. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Vassilevski, Panayot S. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)


    We present a family of scalable preconditioners for matrices arising in the discretization of $H(div)$ problems using the lowest order Raviart--Thomas finite elements. Our approach belongs to the class of “auxiliary space''--based methods and requires only the finite element stiffness matrix plus some minimal additional discretization information about the topology and orientation of mesh entities. Also, we provide a detailed algebraic description of the theory, parallel implementation, and different variants of this parallel auxiliary space divergence solver (ADS) and discuss its relations to the Hiptmair--Xu (HX) auxiliary space decomposition of $H(div)$ [SIAM J. Numer. Anal., 45 (2007), pp. 2483--2509] and to the auxiliary space Maxwell solver AMS [J. Comput. Math., 27 (2009), pp. 604--623]. Finally, an extensive set of numerical experiments demonstrates the robustness and scalability of our implementation on large-scale $H(div)$ problems with large jumps in the material coefficients.

  2. An Introduction to Parallelism, Concurrency and Acceleration (1/2)

    CERN Multimedia

    CERN. Geneva


    Concurrency and parallelism are firm elements of any modern computing infrastructure, made even more prominent by the emergence of accelerators. These lectures offer an introduction to these important concepts. We will begin with a brief refresher of recent hardware offerings to modern-day programmers. We will then open the main discussion with an overview of the laws and practical aspects of scalability. Key parallelism data structures, patterns and algorithms will be shown. The main threats to scalability and mitigation strategies will be discussed in the context of real-life optimization problems. Lecturer's short bio: Andrzej Nowak has 10 years of experience in computing technologies, primarily from CERN openlab and Intel. At CERN, he managed a research lab collaborating with Intel and was part of the openlab Chief Technology Office. Andrzej also worked closely and initiated projects with the private sector (e.g. HP and Google), as well as international research institutes, such as EPFL. Current...

  3. Parallelization of pressure equation solver for incompressible N-S equations

    International Nuclear Information System (INIS)

    Ichihara, Kiyoshi; Yokokawa, Mitsuo; Kaburaki, Hideo.


    A pressure equation solver in a code for 3-dimensional incompressible flow analysis has been parallelized by using red-black SOR method and PCG method on Fujitsu VPP500, a vector parallel computer with distributed memory. For the comparison of scalability, the solver using the red-black SOR method has been also parallelized on the Intel Paragon, a scalar parallel computer with a distributed memory. The scalability of the red-black SOR method on both VPP500 and Paragon was lost, when number of processor elements was increased. The reason of non-scalability on both systems is increasing communication time between processor elements. In addition, the parallelization by DO-loop division makes the vectorizing efficiency lower on VPP500. For an effective implementation on VPP500, a large scale problem which holds very long vectorized DO-loops in the parallel program should be solved. PCG method with red-black SOR method applied to incomplete LU factorization (red-black PCG) has more iteration steps than normal PCG method with forward and backward substitution, in spite of same number of the floating point operations in a DO-loop of incomplete LU factorization. The parallelized red-black PCG method has less merits than the parallelized red-black SOR method when the computational region has fewer grids, because the low vectorization efficiency is obtained in red-black PCG method. (author)

  4. Scalability of a Low-Cost Multi-Teraflop Linux Cluster for High-End Classical Atomistic and Quantum Mechanical Simulations (United States)

    Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash


    Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.

  5. Parallel symbolic execution for automated real-world software testing


    Bucur, Stefan; Ureche, Vlad; Zamfir, Cristian; Candea, George


    This paper introduces Cloud9, a platform for automated testing of real-world software. Our main contribution is the scalable parallelization of symbolic execution on clusters of commodity hardware, to help cope with path explosion. Cloud9 provides a systematic interface for writing "symbolic tests" that concisely specify entire families of inputs and behaviors to be tested, thus improving testing productivity. Cloud9 can handle not only single-threaded programs but also multi-threaded and dis...

  6. Building a Parallel Cloud Storage System using OpenStack’s Swift Object Store and Transformative Parallel I/O

    Energy Technology Data Exchange (ETDEWEB)

    Burns, Andrew J. [Los Alamos National Laboratory; Lora, Kaleb D. [Los Alamos National Laboratory; Martinez, Esteban [Los Alamos National Laboratory; Shorter, Martel L. [Los Alamos National Laboratory


    Our project consists of bleeding-edge research into replacing the traditional storage archives with a parallel, cloud-based storage solution. It used OpenStack's Swift Object Store cloud software. It's Benchmarked Swift for write speed and scalability. Our project is unique because Swift is typically used for reads and we are mostly concerned with write speeds. Cloud Storage is a viable archive solution because: (1) Container management for larger parallel archives might ease the migration workload; (2) Many tools that are written for cloud storage could be utilized for local archive; and (3) Current large cloud storage practices in industry could be utilized to manage a scalable archive solution.

  7. The Scalable Coherent Interface and related standards projects

    International Nuclear Information System (INIS)

    Gustavson, D.B.


    The Scalable Coherent Interface (SCI) project (IEEE P1596) found a way to avoid the limits that are inherent in bus technology. SCI provides bus-like services by transmitting packets on a collection of point-to-point unidirectional links. The SCI protocols support cache coherence in a distributed-shared-memory multiprocessor model, message passing, I/O, and local-area-network-like communication over fiber optic or wire links. VLSI circuits that operate parallel links at 1000 MByte/s and serial links at 1000 Mbit/s will be available early in 1992. Several ongoing SCI-related projects are applying the SCI technology to new areas or extending it to more difficult problems. P1596.1 defines the architecture of a bridge between SCI and VME; P1596.2 compatibly extends the cache coherence mechanism for efficient operation with kiloprocessor systems; P1596.3 defines new low-voltage (about 0.25 V) differential signals suitable for low power interfaces for CMOS or GaAs VLSI implementations of SCI; P1596.4 defines a high performance memory chip interface using these signals; P1596.5 defines data transfer formats for efficient interprocessor communication in heterogeneous multiprocessor systems. This paper reports the current status of SCI, related standards, and new projects. 16 refs

  8. MR-Tree - A Scalable MapReduce Algorithm for Building Decision Trees

    Directory of Open Access Journals (Sweden)

    Vasile PURDILĂ


    Full Text Available Learning decision trees against very large amounts of data is not practical on single node computers due to the huge amount of calculations required by this process. Apache Hadoop is a large scale distributed computing platform that runs on commodity hardware clusters and can be used successfully for data mining task against very large datasets. This work presents a parallel decision tree learning algorithm expressed in MapReduce programming model that runs on Apache Hadoop platform and has a very good scalability with dataset size.

  9. Scalable Tool Infrastructure for the Cray XT Using Tree-Based Overlay Networks

    Energy Technology Data Exchange (ETDEWEB)

    Roth, Philip C [ORNL; Vetter, Jeffrey S [ORNL


    Performance, debugging, and administration tools are critical for the effective use of parallel computing platforms, but traditional tools have failed to overcome several problems that limit their scalability, such as communication between a large number of tool processes and the management and processing of the volume of data generated on a large number of compute nodes. A tree-based overlay network has proven effective for overcoming these challenges. In this paper, we present our experiences in bringing our MRNet tree-based overlay network infrastructure to the Cray XT platform, including a description of proof-of-concept tools that use MRNet on the Cray XT.

  10. READSCAN: A fast and scalable pathogen discovery program with accurate genome relative abundance estimation

    KAUST Repository

    Naeem, Raeece


    Summary: READSCAN is a highly scalable parallel program to identify non-host sequences (of potential pathogen origin) and estimate their genome relative abundance in high-throughput sequence datasets. READSCAN accurately classified human and viral sequences on a 20.1 million reads simulated dataset in <27 min using a small Beowulf compute cluster with 16 nodes (Supplementary Material). Availability: Contact: or Supplementary information: Supplementary data are available at Bioinformatics online. 2012 The Author(s).

  11. An ODMG-compatible testbed architecture for scalable management and analysis of physics data

    International Nuclear Information System (INIS)

    Malon, D.M.; May, E.N.


    This paper describes a testbed architecture for the investigation and development of scalable approaches to the management and analysis of massive amounts of high energy physics data. The architecture has two components: an interface layer that is compliant with a substantial subset of the ODMG-93 Version 1.2 specification, and a lightweight object persistence manager that provides flexible storage and retrieval services on a variety of single- and multi-level storage architectures, and on a range of parallel and distributed computing platforms

  12. Tip-Based Nanofabrication for Scalable Manufacturing

    Directory of Open Access Journals (Sweden)

    Huan Hu


    Full Text Available Tip-based nanofabrication (TBN is a family of emerging nanofabrication techniques that use a nanometer scale tip to fabricate nanostructures. In this review, we first introduce the history of the TBN and the technology development. We then briefly review various TBN techniques that use different physical or chemical mechanisms to fabricate features and discuss some of the state-of-the-art techniques. Subsequently, we focus on those TBN methods that have demonstrated potential to scale up the manufacturing throughput. Finally, we discuss several research directions that are essential for making TBN a scalable nano-manufacturing technology.

  13. Programming Scala Scalability = Functional Programming + Objects

    CERN Document Server

    Wampler, Dean


    Learn how to be more productive with Scala, a new multi-paradigm language for the Java Virtual Machine (JVM) that integrates features of both object-oriented and functional programming. With this book, you'll discover why Scala is ideal for highly scalable, component-based applications that support concurrency and distribution. Programming Scala clearly explains the advantages of Scala as a JVM language. You'll learn how to leverage the wealth of Java class libraries to meet the practical needs of enterprise and Internet projects more easily. Packed with code examples, this book provides us

  14. Scalable and Anonymous Group Communication with MTor

    Directory of Open Access Journals (Sweden)

    Lin Dong


    Full Text Available This paper presents MTor, a low-latency anonymous group communication system. We construct MTor as an extension to Tor, allowing the construction of multi-source multicast trees on top of the existing Tor infrastructure. MTor does not depend on an external service to broker the group communication, and avoids central points of failure and trust. MTor’s substantial bandwidth savings and graceful scalability enable new classes of anonymous applications that are currently too bandwidth-intensive to be viable through traditional unicast Tor communication-e.g., group file transfer, collaborative editing, streaming video, and real-time audio conferencing.

  15. Cooperative Scalable Moving Continuous Query Processing

    DEFF Research Database (Denmark)

    Li, Xiaohui; Karras, Panagiotis; Jensen, Christian S.


    A range of applications call for a mobile client to continuously monitor others in close proximity. Past research on such problems has covered two extremes: It has offered totally centralized solutions, where a server takes care of all queries, and totally distributed solutions, in which...... there is no central authority at all. Unfortunately, none of these two solutions scales to intensive moving object tracking applications, where each client poses a query. In this paper, we formulate the moving continuous query (MCQ) problem and propose a balanced model where servers cooperatively take care...... and computation cost for both servers and clients. An experimental study demonstrates that our approaches offer better scalability than competitors...

  16. Scalable group level probabilistic sparse factor analysis

    DEFF Research Database (Denmark)

    Hinrich, Jesper Løve; Nielsen, Søren Føns Vind; Riis, Nicolai Andre Brogaard


    Many data-driven approaches exist to extract neural representations of functional magnetic resonance imaging (fMRI) data, but most of them lack a proper probabilistic formulation. We propose a scalable group level probabilistic sparse factor analysis (psFA) allowing spatially sparse maps, component...... pruning using automatic relevance determination (ARD) and subject specific heteroscedastic spatial noise modeling. For task-based and resting state fMRI, we show that the sparsity constraint gives rise to components similar to those obtained by group independent component analysis. The noise modeling...

  17. Towards a Scalable, Biomimetic, Antibacterial Coating (United States)

    Dickson, Mary Nora

    Corneal afflictions are the second leading cause of blindness worldwide. When a corneal transplant is unavailable or contraindicated, an artificial cornea device is the only chance to save sight. Bacterial or fungal biofilm build up on artificial cornea devices can lead to serious complications including the need for systemic antibiotic treatment and even explantation. As a result, much emphasis has been placed on anti-adhesion chemical coatings and antibiotic leeching coatings. These methods are not long-lasting, and microorganisms can eventually circumvent these measures. Thus, I have developed a surface topographical antimicrobial coating. Various surface structures including rough surfaces, superhydrophobic surfaces, and the natural surfaces of insects' wings and sharks' skin are promising anti-biofilm candidates, however none meet the criteria necessary for implementation on the surface of an artificial cornea device. In this thesis I: 1) developed scalable fabrication protocols for a library of biomimetic nanostructure polymer surfaces 2) assessed the potential these for poly(methyl methacrylate) nanopillars to kill or prevent formation of biofilm by E. coli bacteria and species of Pseudomonas and Staphylococcus bacteria and improved upon a proposed mechanism for the rupture of Gram-negative bacterial cell walls 3) developed a scalable, commercially viable method for producing antibacterial nanopillars on a curved, PMMA artificial cornea device and 4) developed scalable fabrication protocols for implantation of antibacterial nanopatterned surfaces on the surfaces of thermoplastic polyurethane materials, commonly used in catheter tubings. This project constitutes a first step towards fabrication of the first entirely PMMA artificial cornea device. The major finding of this work is that by precisely controlling the topography of a polymer surface at the nano-scale, we can kill adherent bacteria and prevent biofilm formation of certain pathogenic bacteria

  18. Scalable Optical-Fiber Communication Networks (United States)

    Chow, Edward T.; Peterson, John C.


    Scalable arbitrary fiber extension network (SAFEnet) is conceptual fiber-optic communication network passing digital signals among variety of computers and input/output devices at rates from 200 Mb/s to more than 100 Gb/s. Intended for use with very-high-speed computers and other data-processing and communication systems in which message-passing delays must be kept short. Inherent flexibility makes it possible to match performance of network to computers by optimizing configuration of interconnections. In addition, interconnections made redundant to provide tolerance to faults.

  19. Parallel Computing Characteristics of Two-Phase Thermal-Hydraulics code, CUPID

    International Nuclear Information System (INIS)

    Lee, Jae Ryong; Yoon, Han Young


    Parallelized CUPID code has proved to be able to reproduce multi-dimensional thermal hydraulic analysis by validating with various conceptual problems and experimental data. In this paper, the characteristics of the parallelized CUPID code were investigated. Both single- and two phase simulation are taken into account. Since the scalability of a parallel simulation is known to be better for fine mesh system, two types of mesh system are considered. In addition, the dependency of the preconditioner for matrix solver was also compared. The scalability for the single-phase flow is better than that for two-phase flow due to the less numbers of iterations for solving pressure matrix. The CUPID code was investigated the parallel performance in terms of scalability. The CUPID code was parallelized with domain decomposition method. The MPI library was adopted to communicate the information at the interface cells. As increasing the number of mesh, the scalability is improved. For a given mesh, single-phase flow simulation with diagonal preconditioner shows the best speedup. However, for the two-phase flow simulation, the ILU preconditioner is recommended since it reduces the overall simulation time

  20. Modular Universal Scalable Ion-trap Quantum Computer (United States)


    trap quantum computer . This architecture has two separate layers of scalability: the first is to increase the number of ion qubits in a single trap...Distribution Unlimited UU UU UU UU 02-06-2016 1-Aug-2010 31-Jan-2016 Final Report: Modular Universal Scalable Ion-trap Quantum Computer The views...P.O. Box 12211 Research Triangle Park, NC 27709-2211 Ion trap quantum computation , scalable modular architectures REPORT DOCUMENTATION PAGE 11

  1. Parallel implementations of 2D explicit Euler solvers

    International Nuclear Information System (INIS)

    Giraud, L.; Manzini, G.


    In this work we present a subdomain partitioning strategy applied to an explicit high-resolution Euler solver. We describe the design of a portable parallel multi-domain code suitable for parallel environments. We present several implementations on a representative range of MlMD computers that include shared memory multiprocessors, distributed virtual shared memory computers, as well as networks of workstations. Computational results are given to illustrate the efficiency, the scalability, and the limitations of the different approaches. We discuss also the effect of the communication protocol on the optimal domain partitioning strategy for the distributed memory computers

  2. Practical parallel computing

    CERN Document Server

    Morse, H Stephen


    Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi

  3. Parallel sorting algorithms

    CERN Document Server

    Akl, Selim G


    Parallel Sorting Algorithms explains how to use parallel algorithms to sort a sequence of items on a variety of parallel computers. The book reviews the sorting problem, the parallel models of computation, parallel algorithms, and the lower bounds on the parallel sorting problems. The text also presents twenty different algorithms, such as linear arrays, mesh-connected computers, cube-connected computers. Another example where algorithm can be applied is on the shared-memory SIMD (single instruction stream multiple data stream) computers in which the whole sequence to be sorted can fit in the

  4. A scalable neuroinformatics data flow for electrophysiological signals using MapReduce (United States)

    Jayapandian, Catherine; Wei, Annan; Ramesh, Priya; Zonjy, Bilal; Lhatoo, Samden D.; Loparo, Kenneth; Zhang, Guo-Qiang; Sahoo, Satya S.


    Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications. PMID:25852536

  5. Parallel computing in enterprise modeling.

    Energy Technology Data Exchange (ETDEWEB)

    Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.; Vanderveen, Keith; Ray, Jaideep; Heath, Zach; Allan, Benjamin A.


    This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.

  6. Performance studies of the parallel VIM code

    International Nuclear Information System (INIS)

    Shi, B.; Blomquist, R.N.


    In this paper, the authors evaluate the performance of the parallel version of the VIM Monte Carlo code on the IBM SPx at the High Performance Computing Research Facility at ANL. Three test problems with contrasting computational characteristics were used to assess effects in performance. A statistical method for estimating the inefficiencies due to load imbalance and communication is also introduced. VIM is a large scale continuous energy Monte Carlo radiation transport program and was parallelized using history partitioning, the master/worker approach, and p4 message passing library. Dynamic load balancing is accomplished when the master processor assigns chunks of histories to workers that have completed a previously assigned task, accommodating variations in the lengths of histories, processor speeds, and worker loads. At the end of each batch (generation), the fission sites and tallies are sent from each worker to the master process, contributing to the parallel inefficiency. All communications are between master and workers, and are serial. The SPx is a scalable 128-node parallel supercomputer with high-performance Omega switches of 63 microsec latency and 35 MBytes/sec bandwidth. For uniform and reproducible performance, they used only the 120 identical regular processors (IBM RS/6000) and excluded the remaining eight planet nodes, which may be loaded by other's jobs

  7. Scalability Optimization of Seamless Positioning Service

    Directory of Open Access Journals (Sweden)

    Juraj Machaj


    Full Text Available Recently positioning services are getting more attention not only within research community but also from service providers. From the service providers point of view positioning service that will be able to work seamlessly in all environments, for example, indoor, dense urban, and rural, has a huge potential to open new markets. However, such system does not only need to provide accurate position estimates but have to be scalable and resistant to fake positioning requests. In the previous works we have proposed a modular system, which is able to provide seamless positioning in various environments. The system automatically selects optimal positioning module based on available radio signals. The system currently consists of three positioning modules—GPS, GSM based positioning, and Wi-Fi based positioning. In this paper we will propose algorithm which will reduce time needed for position estimation and thus allow higher scalability of the modular system and thus allow providing positioning services to higher amount of users. Such improvement is extremely important, for real world application where large number of users will require position estimates, since positioning error is affected by response time of the positioning server.

  8. Towards Scalable Graph Computation on Mobile Devices (United States)

    Chen, Yiqi; Lin, Zhiyuan; Pienta, Robert; Kahng, Minsuk; Chau, Duen Horng


    Mobile devices have become increasingly central to our everyday activities, due to their portability, multi-touch capabilities, and ever-improving computational power. Such attractive features have spurred research interest in leveraging mobile devices for computation. We explore a novel approach that aims to use a single mobile device to perform scalable graph computation on large graphs that do not fit in the device's limited main memory, opening up the possibility of performing on-device analysis of large datasets, without relying on the cloud. Based on the familiar memory mapping capability provided by today's mobile operating systems, our approach to scale up computation is powerful and intentionally kept simple to maximize its applicability across the iOS and Android platforms. Our experiments demonstrate that an iPad mini can perform fast computation on large real graphs with as many as 272 million edges (Google+ social graph), at a speed that is only a few times slower than a 13″ Macbook Pro. Through creating a real world iOS app with this technique, we demonstrate the strong potential application for scalable graph computation on a single mobile device using our approach. PMID:25859564

  9. Towards Scalable Graph Computation on Mobile Devices. (United States)

    Chen, Yiqi; Lin, Zhiyuan; Pienta, Robert; Kahng, Minsuk; Chau, Duen Horng


    Mobile devices have become increasingly central to our everyday activities, due to their portability, multi-touch capabilities, and ever-improving computational power. Such attractive features have spurred research interest in leveraging mobile devices for computation. We explore a novel approach that aims to use a single mobile device to perform scalable graph computation on large graphs that do not fit in the device's limited main memory, opening up the possibility of performing on-device analysis of large datasets, without relying on the cloud. Based on the familiar memory mapping capability provided by today's mobile operating systems, our approach to scale up computation is powerful and intentionally kept simple to maximize its applicability across the iOS and Android platforms. Our experiments demonstrate that an iPad mini can perform fast computation on large real graphs with as many as 272 million edges (Google+ social graph), at a speed that is only a few times slower than a 13″ Macbook Pro. Through creating a real world iOS app with this technique, we demonstrate the strong potential application for scalable graph computation on a single mobile device using our approach.

  10. Big data integration: scalability and sustainability

    KAUST Repository

    Zhang, Zhang


    Integration of various types of omics data is critically indispensable for addressing most important and complex biological questions. In the era of big data, however, data integration becomes increasingly tedious, time-consuming and expensive, posing a significant obstacle to fully exploit the wealth of big biological data. Here we propose a scalable and sustainable architecture that integrates big omics data through community-contributed modules. Community modules are contributed and maintained by different committed groups and each module corresponds to a specific data type, deals with data collection, processing and visualization, and delivers data on-demand via web services. Based on this community-based architecture, we build Information Commons for Rice (IC4R;, a rice knowledgebase that integrates a variety of rice omics data from multiple community modules, including genome-wide expression profiles derived entirely from RNA-Seq data, resequencing-based genomic variations obtained from re-sequencing data of thousands of rice varieties, plant homologous genes covering multiple diverse plant species, post-translational modifications, rice-related literatures, and community annotations. Taken together, such architecture achieves integration of different types of data from multiple community-contributed modules and accordingly features scalable, sustainable and collaborative integration of big data as well as low costs for database update and maintenance, thus helpful for building IC4R into a comprehensive knowledgebase covering all aspects of rice data and beneficial for both basic and translational researches.

  11. Efficient Parallel Execution of Event-Driven Electromagnetic Hybrid Models

    Energy Technology Data Exchange (ETDEWEB)

    Perumalla, Kalyan S [ORNL; Karimabadi, Dr. Homa [SciberQuest Inc.; Fujimoto, Richard [ORNL


    New discrete-event formulations of physics simulation models are emerging that can outperform traditional time-stepped models, especially in simulations containing multiple timescales. Detailed simulation of the Earth's magnetosphere, for example, requires execution of sub-models that operate at timescales that differ by orders of magnitude. In contrast to time-stepped simulation which requires tightly coupled updates to almost the entire system state at regular time intervals, the new discrete event simulation (DES) approaches help evolve the states of sub-models on relatively independent timescales. However, in contrast to relative ease of parallelization of time-stepped codes, the parallelization of DES-based models raises challenges with respect to their scalability and performance. One of the key challenges is to improve the computation granularity to offset synchronization and communication overheads within and across processors. Our previous work on parallelization was limited in scalability and runtime performance due to such challenges. Here we report on optimizations we performed on DES-based plasma simulation models to improve parallel execution performance. The mapping of the model to simulation processes is optimized via aggregation techniques, and the parallel runtime engine is optimized for communication and memory efficiency. The net result is the capability to simulate hybrid particle-in-cell (PIC) models with over 2 billion ion particles using 512 processors on supercomputing platforms.

  12. A Massively Parallel Sparse Eigensolver for Structural Dynamics Finite Element Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Day, David M.; Reese, G.M.


    Eigenanalysis is a critical component of structural dynamics which is essential for determinating the vibrational response of systems. This effort addresses the development of numerical algorithms associated with scalable eigensolver techniques suitable for use on massively parallel, distributed memory computers that are capable of solving large scale structural dynamics problems. An iterative Lanczos method was determined to be the best choice for the application. Scalability of the eigenproblem depends on scalability of the underlying linear solver. A multi-level solver (FETI) was selected as most promising for this component. Issues relating to heterogeneous materials, mechanisms and multipoint constraints have been examined, and the linear solver algorithm has been developed to incorporate features that result in a scalable, robust algorithm for practical structural dynamics applications. The resulting tools have been demonstrated on large problems representative of a weapon's system.

  13. Scalable Strategies for Computing with Massive Data

    Directory of Open Access Journals (Sweden)

    Michael Kane


    Full Text Available This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.

  14. GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.

    Directory of Open Access Journals (Sweden)

    Ahmed Shamsul Arefin

    Full Text Available BACKGROUND: The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers. An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU, can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. RESULTS: We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. CONCLUSION: Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL at

  15. GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs. (United States)

    Arefin, Ahmed Shamsul; Riveros, Carlos; Berretta, Regina; Moscato, Pablo


    The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers). An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU), can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN) search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour) for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN) provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL) at

  16. Introduction to parallel programming

    CERN Document Server

    Brawer, Steven


    Introduction to Parallel Programming focuses on the techniques, processes, methodologies, and approaches involved in parallel programming. The book first offers information on Fortran, hardware and operating system models, and processes, shared memory, and simple parallel programs. Discussions focus on processes and processors, joining processes, shared memory, time-sharing with multiple processors, hardware, loops, passing arguments in function/subroutine calls, program structure, and arithmetic expressions. The text then elaborates on basic parallel programming techniques, barriers and race

  17. Parallel computing works!

    CERN Document Server

    Fox, Geoffrey C; Messina, Guiseppe C


    A clear illustration of how parallel computers can be successfully appliedto large-scale scientific computations. This book demonstrates how avariety of applications in physics, biology, mathematics and other scienceswere implemented on real parallel computers to produce new scientificresults. It investigates issues of fine-grained parallelism relevant forfuture supercomputers with particular emphasis on hypercube architecture. The authors describe how they used an experimental approach to configuredifferent massively parallel machines, design and implement basic systemsoftware, and develop

  18. Parallel simulation today (United States)

    Nicol, David; Fujimoto, Richard


    This paper surveys topics that presently define the state of the art in parallel simulation. Included in the tutorial are discussions on new protocols, mathematical performance analysis, time parallelism, hardware support for parallel simulation, load balancing algorithms, and dynamic memory management for optimistic synchronization.

  19. Applications of the Aurora parallel Prolog system to computational molecular biology

    Energy Technology Data Exchange (ETDEWEB)

    Lusk, E.L.; Overbeek, R. [Argonne National Lab., IL (United States); Mudambi, S. [Knox Coll., Galesburg, IL (United States); Szeredi, P. [IQSOFT, Budapest (Hungary)


    We describe an investigation into the use of the Aurora parallel Prolog system in two applications within the area of computational molecular biology. The computational requirements were large, due to the nature of the applications, and were large, due to the nature of the applications, and were carried out on a scalable parallel computer the BBN ``Butterfly`` TC-2000. Results include both a demonstration that logic programming can be effective in the context of demanding applications on large-scale parallel machines, and some insights into parallel programming in Prolog.

  20. Benefits of Parallel I/O in Ab Initio Nuclear Physics Calculations

    International Nuclear Information System (INIS)

    Laghave, Nikhil; Sosonkina, Masha; Maris, Pieter; Vary, James P.


    Many modern scientific applications rely on highly parallel calculations, which scale to 10's of thousands processors. However, most applications do not concentrate on parallelizing input/output operations. In particular, sequential I/O has been identified as a bottleneck for the highly scalable MFDn (Many Fermion Dynamics for nuclear structure) code performing ab initio nuclear structure calculations. In this paper, we develop interfaces and parallel I/O procedures to use a well-known parallel I/O library in MFDn. As a result, we gain efficient input/output of large datasets along with their portability and ease of use in the downstream processing.

  1. Scalable collision detection using p-partition fronts on many-core processors. (United States)

    Zhang, Xinyu; Kim, Young J


    We present a new parallel algorithm for collision detection using many-core computing platforms of CPUs or GPUs. Based on the notion of a $(p)$-partition front, our algorithm is able to evenly partition and distribute the workload of BVH traversal among multiple processing cores without the need for dynamic balancing, while minimizing the memory overhead inherent to the state-of-the-art parallel collision detection algorithms. We demonstrate the scalability of our algorithm on different benchmarking scenarios with and without using temporal coherence, including dynamic simulation of rigid bodies, cloth simulation, and random collision courses. In these experiments, we observe nearly linear performance improvement in terms of the number of processing cores on the CPUs and GPUs.

  2. Highly Scalable Asynchronous Computing Method for Partial Differential Equations: A Path Towards Exascale (United States)

    Konduri, Aditya

    Many natural and engineering systems are governed by nonlinear partial differential equations (PDEs) which result in a multiscale phenomena, e.g. turbulent flows. Numerical simulations of these problems are computationally very expensive and demand for extreme levels of parallelism. At realistic conditions, simulations are being carried out on massively parallel computers with hundreds of thousands of processing elements (PEs). It has been observed that communication between PEs as well as their synchronization at these extreme scales take up a significant portion of the total simulation time and result in poor scalability of codes. This issue is likely to pose a bottleneck in scalability of codes on future Exascale systems. In this work, we propose an asynchronous computing algorithm based on widely used finite difference methods to solve PDEs in which synchronization between PEs due to communication is relaxed at a mathematical level. We show that while stability is conserved when schemes are used asynchronously, accuracy is greatly degraded. Since message arrivals at PEs are random processes, so is the behavior of the error. We propose a new statistical framework in which we show that average errors drop always to first-order regardless of the original scheme. We propose new asynchrony-tolerant schemes that maintain accuracy when synchronization is relaxed. The quality of the solution is shown to depend, not only on the physical phenomena and numerical schemes, but also on the characteristics of the computing machine. A novel algorithm using remote memory access communications has been developed to demonstrate excellent scalability of the method for large-scale computing. Finally, we present a path to extend this method in solving complex multi-scale problems on Exascale machines.

  3. Scalable Transactions for Web Applications in the Cloud

    NARCIS (Netherlands)

    Zhou, W.; Pierre, G.E.O.; Chi, C.-H.


    Cloud Computing platforms provide scalability and high availability properties for web applications but they sacrifice data consistency at the same time. However, many applications cannot afford any data inconsistency. We present a scalable transaction manager for NoSQL cloud database services to

  4. Building scalable apps with Redis and Node.js

    CERN Document Server

    Johanan, Joshua


    If the phrase scalability sounds alien to you, then this is an ideal book for you. You will not need much Node.js experience as each framework is demonstrated in a way that requires no previous knowledge of the framework. You will be building scalable Node.js applications in no time! Knowledge of JavaScript is required.

  5. New Complexity Scalable MPEG Encoding Techniques for Mobile Applications

    Directory of Open Access Journals (Sweden)

    Stephan Mietens


    Full Text Available Complexity scalability offers the advantage of one-time design of video applications for a large product family, including mobile devices, without the need of redesigning the applications on the algorithmic level to meet the requirements of the different products. In this paper, we present complexity scalable MPEG encoding having core modules with modifications for scalability. The interdependencies of the scalable modules and the system performance are evaluated. Experimental results show scalability giving a smooth change in complexity and corresponding video quality. Scalability is basically achieved by varying the number of computed DCT coefficients and the number of evaluated motion vectors but other modules are designed such they scale with the previous parameters. In the experiments using the “Stefan” sequence, the elapsed execution time of the scalable encoder, reflecting the computational complexity, can be gradually reduced to roughly 50% of its original execution time. The video quality scales between 20 dB and 48 dB PSNR with unity quantizer setting, and between 21.5 dB and 38.5 dB PSNR for different sequences targeting 1500 kbps. The implemented encoder and the scalability techniques can be successfully applied in mobile systems based on MPEG video compression.

  6. GRACOS: Scalable and Load Balanced P3M Cosmological N-body Code (United States)

    Shirokov, Alexander; Bertschinger, Edmund


    The GRACOS (GRAvitational COSmology) code, a parallel implementation of the particle-particle/particle-mesh (P3M) algorithm for distributed memory clusters, uses a hybrid method for both computation and domain decomposition. Long-range forces are computed using a Fourier transform gravity solver on a regular mesh; the mesh is distributed across parallel processes using a static one-dimensional slab domain decomposition. Short-range forces are computed by direct summation of close pairs; particles are distributed using a dynamic domain decomposition based on a space-filling Hilbert curve. A nearly-optimal method was devised to dynamically repartition the particle distribution so as to maintain load balance even for extremely inhomogeneous mass distributions. Tests using 800(3) simulations on a 40-processor beowulf cluster showed good load balance and scalability up to 80 processes. There are limits on scalability imposed by communication and extreme clustering which may be removed by extending the algorithm to include adaptive mesh refinement.

  7. The Open Connectome Project Data Cluster: Scalable Analysis and Vision for High-Throughput Neuroscience. (United States)

    Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R; Bock, Davi D; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R Clay; Smith, Stephen J; Szalay, Alexander S; Vogelstein, Joshua T; Vogelstein, R Jacob


    We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes - neural connectivity maps of the brain-using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems-reads to parallel disk arrays and writes to solid-state storage-to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization.

  8. Scalable on-chip quantum state tomography (United States)

    Titchener, James G.; Gräfe, Markus; Heilmann, René; Solntsev, Alexander S.; Szameit, Alexander; Sukhorukov, Andrey A.


    Quantum information systems are on a path to vastly exceed the complexity of any classical device. The number of entangled qubits in quantum devices is rapidly increasing, and the information required to fully describe these systems scales exponentially with qubit number. This scaling is the key benefit of quantum systems, however it also presents a severe challenge. To characterize such systems typically requires an exponentially long sequence of different measurements, becoming highly resource demanding for large numbers of qubits. Here we propose and demonstrate a novel and scalable method for characterizing quantum systems based on expanding a multi-photon state to larger dimensionality. We establish that the complexity of this new measurement technique only scales linearly with the number of qubits, while providing a tomographically complete set of data without a need for reconfigurability. We experimentally demonstrate an integrated photonic chip capable of measuring two- and three-photon quantum states with statistical reconstruction fidelity of 99.71%.

  9. Applications for the scalable coherent interface (United States)

    Gustavson, David B.


    IEEE P1596, the Scalable Coherent Interface (formerly known as SuperBus) is based on experience gained while developing Fastbus (ANSI/IEEE 960-1986, IEC 935), Futurebus (IEEE P896.x) and other modern high-performance buses. SCI goals include a minimum and bandwidth of 1 GByte/sec per processor in multiprocessor systems with thousands of processors; efficient support of a coherent distributed-cache image of distributed shared memory; support for bridges which interface to existing or future buses; and support for inexpensive small rings as well as for general switched interconnections like Banyan, Omega, or crossbar networks. This paper reports to the status of the work in progress and suggests some applications in data acquisition and physics.

  10. Scalable quantum search using trapped ions

    International Nuclear Information System (INIS)

    Ivanov, S. S.; Ivanov, P. A.; Linington, I. E.; Vitanov, N. V.


    We propose a scalable implementation of Grover's quantum search algorithm in a trapped-ion quantum information processor. The system is initialized in an entangled Dicke state by using adiabatic techniques. The inversion-about-average and oracle operators take the form of single off-resonant laser pulses. This is made possible by utilizing the physical symmetries of the trapped-ion linear crystal. The physical realization of the algorithm represents a dramatic simplification: each logical iteration (oracle and inversion about average) requires only two physical interaction steps, in contrast to the large number of concatenated gates required by previous approaches. This not only facilitates the implementation but also increases the overall fidelity of the algorithm.

  11. BASSET: Scalable Gateway Finder in Large Graphs

    Energy Technology Data Exchange (ETDEWEB)

    Tong, H; Papadimitriou, S; Faloutsos, C; Yu, P S; Eliassi-Rad, T


    Given a social network, who is the best person to introduce you to, say, Chris Ferguson, the poker champion? Or, given a network of people and skills, who is the best person to help you learn about, say, wavelets? The goal is to find a small group of 'gateways': persons who are close enough to us, as well as close enough to the target (person, or skill) or, in other words, are crucial in connecting us to the target. The main contributions are the following: (a) we show how to formulate this problem precisely; (b) we show that it is sub-modular and thus it can be solved near-optimally; (c) we give fast, scalable algorithms to find such gateways. Experiments on real data sets validate the effectiveness and efficiency of the proposed methods, achieving up to 6,000,000x speedup.

  12. Scalable graphene aptasensors for drug quantification (United States)

    Vishnubhotla, Ramya; Ping, Jinglei; Gao, Zhaoli; Lee, Abigail; Saouaf, Olivia; Vrudhula, Amey; Johnson, A. T. Charlie


    Simpler and more rapid approaches for therapeutic drug-level monitoring are highly desirable to enable use at the point-of-care. We have developed an all-electronic approach for detection of the HIV drug tenofovir based on scalable fabrication of arrays of graphene field-effect transistors (GFETs) functionalized with a commercially available DNA aptamer. The shift in the Dirac voltage of the GFETs varied systematically with the concentration of tenofovir in deionized water, with a detection limit less than 1 ng/mL. Tests against a set of negative controls confirmed the specificity of the sensor response. This approach offers the potential for further development into a rapid and convenient point-of-care tool with clinically relevant performance.

  13. The Concept of Business Model Scalability

    DEFF Research Database (Denmark)

    Nielsen, Christian; Lund, Morten


    for a long-term profitable business. However, the message conveyed in this article is that while providing a good value proposition may help the firm ‘get by’, the really successful businesses of today are those able to reach the sweet-spot of business model scalability. This article introduces and discusses......The power of business models lies in their ability to visualize and clarify how firms’ may configure their value creation processes. Among the key aspects of business model thinking are a focus on what the customer values, how this value is best delivered to the customer and how strategic partners...... are leveraged in this value creation, delivery and realization exercise. Central to the mainstream understanding of business models is the value proposition towards the customer and the hypothesis generated is that if the firm delivers to the customer what he/she requires, then there is a good foundation...

  14. Parallel Atomistic Simulations

    Energy Technology Data Exchange (ETDEWEB)



    Algorithms developed to enable the use of atomistic molecular simulation methods with parallel computers are reviewed. Methods appropriate for bonded as well as non-bonded (and charged) interactions are included. While strategies for obtaining parallel molecular simulations have been developed for the full variety of atomistic simulation methods, molecular dynamics and Monte Carlo have received the most attention. Three main types of parallel molecular dynamics simulations have been developed, the replicated data decomposition, the spatial decomposition, and the force decomposition. For Monte Carlo simulations, parallel algorithms have been developed which can be divided into two categories, those which require a modified Markov chain and those which do not. Parallel algorithms developed for other simulation methods such as Gibbs ensemble Monte Carlo, grand canonical molecular dynamics, and Monte Carlo methods for protein structure determination are also reviewed and issues such as how to measure parallel efficiency, especially in the case of parallel Monte Carlo algorithms with modified Markov chains are discussed.

  15. A parallel buffer tree

    DEFF Research Database (Denmark)

    Sitchinava, Nodar; Zeh, Norbert


    We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number...... of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism. The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of N insertions, deletions, membership queries, and range queries...... in the optimal OhOf(psortN + K/PB) parallel I/O complexity, where K is the size of the output reported in the process and psortN is the parallel I/O complexity of sorting N elements using P processors....

  16. Improving diabetes medication adherence: successful, scalable interventions

    Directory of Open Access Journals (Sweden)

    Zullig LL


    Full Text Available Leah L Zullig,1,2 Walid F Gellad,3,4 Jivan Moaddeb,2,5 Matthew J Crowley,1,2 William Shrank,6 Bradi B Granger,7 Christopher B Granger,8 Troy Trygstad,9 Larry Z Liu,10 Hayden B Bosworth1,2,7,11 1Center for Health Services Research in Primary Care, Durham Veterans Affairs Medical Center, Durham, NC, USA; 2Department of Medicine, Duke University, Durham, NC, USA; 3Center for Health Equity Research and Promotion, Pittsburgh Veterans Affairs Medical Center, Pittsburgh, PA, USA; 4Division of General Internal Medicine, University of Pittsburgh, Pittsburgh, PA, USA; 5Institute for Genome Sciences and Policy, Duke University, Durham, NC, USA; 6CVS Caremark Corporation; 7School of Nursing, Duke University, Durham, NC, USA; 8Department of Medicine, Division of Cardiology, Duke University School of Medicine, Durham, NC, USA; 9North Carolina Community Care Networks, Raleigh, NC, USA; 10Pfizer, Inc., and Weill Medical College of Cornell University, New York, NY, USA; 11Department of Psychiatry and Behavioral Sciences, Duke University School of Medicine, Durham, NC, USA Abstract: Effective medications are a cornerstone of prevention and disease treatment, yet only about half of patients take their medications as prescribed, resulting in a common and costly public health challenge for the US healthcare system. Since poor medication adherence is a complex problem with many contributing causes, there is no one universal solution. This paper describes interventions that were not only effective in improving medication adherence among patients with diabetes, but were also potentially scalable (ie, easy to implement to a large population. We identify key characteristics that make these interventions effective and scalable. This information is intended to inform healthcare systems seeking proven, low resource, cost-effective solutions to improve medication adherence. Keywords: medication adherence, diabetes mellitus, chronic disease, dissemination research

  17. Scalable and Media Aware Adaptive Video Streaming over Wireless Networks

    Directory of Open Access Journals (Sweden)

    Béatrice Pesquet-Popescu


    Full Text Available This paper proposes an advanced video streaming system based on scalable video coding in order to optimize resource utilization in wireless networks with retransmission mechanisms at radio protocol level. The key component of this system is a packet scheduling algorithm which operates on the different substreams of a main scalable video stream and which is implemented in a so-called media aware network element. The concerned type of transport channel is a dedicated channel subject to parameters (bitrate, loss rate variations on the long run. Moreover, we propose a combined scalability approach in which common temporal and SNR scalability features can be used jointly with a partitioning of the image into regions of interest. Simulation results show that our approach provides substantial quality gain compared to classical packet transmission methods and they demonstrate how ROI coding combined with SNR scalability allows to improve again the visual quality.

  18. Parallel single cell analysis on an integrated microfluidic platform for cell trapping, lysis and analysis

    NARCIS (Netherlands)

    le Gac, Severine; de Boer, Hans L.; Wijnperle, Daniël; Meuleman, W.; Carlen, Edwin; van den Berg, Albert; Kim, Tae Song; Lee, Yoon-Sik; Chung, Taek-Dong; Jeon, Noo Li; Lee, Sang-Hoon; Suh, Kahp-Yang; Choo, Jaebum; Kim, Yong-Kweon


    We report here a novel and easily scalable microfluidic platform for the parallel analysis of hundreds of individual cells, with controlled single cell trapping, followed by their lysis and subsequent retrieval of the cellular content for on-chip analysis. The device consists of a main channel and

  19. A family of domain decomposition methods for the massively parallel solution of computational mechanics problems (United States)

    Pierson, Kendall Hugh

    The Finite Element Tearing and Interconnecting (FETI) algorithms are numerically scalable iterative domain decomposition methods for solving systems of equations generated from the finite element discretization of second- or fourth-order elasticity problems. These methods have been substantially improved over the last ten years and recently shown parallel scalability up to one thousand processors. The purpose of this thesis is to present and investigate a dual-primal FETI method, which addresses some of the critical issues related to the original FETI methods. These critical issues involve the accurate computation of the local rigid body modes, the cost and size of the FETI coarse problems with respect to fourth-order elasticity problems, and the overall robustness and versatility of the equation solver. These improvements due to the dual-primal FETI formulation are especially beneficial when implemented on massively parallel distributed memory computers such as the Accelerated Strategic Computing Initiative (ASCI) Red Option supercomputer. Numerical results will be shown detailing scalability with respect to the mesh size, subdomain size, and the number of elements per subdomain for both second- and fourth-order elasticity problems. Parallel scalability will be reported for various large scale realistic problems on a SGI Origin 2000 and the ASCI Red option massively parallel supercomputer. Lastly, results from linear dynamics, eigenvalue analysis and geometrically non-linear static problems will be shown highlighting the benefits of FETI methods for solving large-scale problems with multiple right hand sides.

  20. Parallel linear solvers for simulations of reactor thermal hydraulics

    International Nuclear Information System (INIS)

    Yan, Y.; Antal, S.P.; Edge, B.; Keyes, D.E.; Shaver, D.; Bolotnov, I.A.; Podowski, M.Z.


    The state-of-the-art multiphase fluid dynamics code, NPHASE-CMFD, performs multiphase flow simulations in complex domains using implicit nonlinear treatment of the governing equations and in parallel, which is a very challenging environment for the linear solver. The present work illustrates how the Portable, Extensible Toolkit for Scientific Computation (PETSc) and scalable Algebraic Multigrid (AMG) preconditioner from Hypre can be utilized to construct robust and scalable linear solvers for the Newton correction equation obtained from the discretized system of governing conservation equations in NPHASE-CMFD. The overall long-tem objective of this work is to extend the NPHASE-CMFD code into a fully-scalable solver of multiphase flow and heat transfer problems, applicable to both steady-state and stiff time-dependent phenomena in complete fuel assemblies of nuclear reactors and, eventually, the entire reactor core (such as the Virtual Reactor concept envisioned by CASL). This campaign appropriately begins with the linear algebraic equation solver, which is traditionally a bottleneck to scalability in PDE-based codes. The computational complexity of the solver is usually superlinear in problem size, whereas the rest of the code, the “physics” portion, usually has its complexity linear in the problem size. (author)

  1. A Hardware-Efficient Scalable Spike Sorting Neural Signal Processor Module for Implantable High-Channel-Count Brain Machine Interfaces. (United States)

    Yang, Yuning; Boling, Sam; Mason, Andrew J


    Next-generation brain machine interfaces demand a high-channel-count neural recording system to wirelessly monitor activities of thousands of neurons. A hardware efficient neural signal processor (NSP) is greatly desirable to ease the data bandwidth bottleneck for a fully implantable wireless neural recording system. This paper demonstrates a complete multichannel spike sorting NSP module that incorporates all of the necessary spike detector, feature extractor, and spike classifier blocks. To meet high-channel-count and implantability demands, each block was designed to be highly hardware efficient and scalable while sharing resources efficiently among multiple channels. To process multiple channels in parallel, scalability analysis was performed, and the utilization of each block was optimized according to its input data statistics and the power, area and/or speed of each block. Based on this analysis, a prototype 32-channel spike sorting NSP scalable module was designed and tested on an FPGA using synthesized datasets over a wide range of signal to noise ratios. The design was mapped to 130 nm CMOS to achieve 0.75 μW power and 0.023 mm 2 area consumptions per channel based on post synthesis simulation results, which permits scalability of digital processing to 690 channels on a 4×4 mm 2 electrode array.

  2. Scalable Molecular Dynamics for Large Biomolecular Systems

    Directory of Open Access Journals (Sweden)

    Robert K. Brunner


    Full Text Available We present an optimized parallelization scheme for molecular dynamics simulations of large biomolecular systems, implemented in the production-quality molecular dynamics program NAMD. With an object-based hybrid force and spatial decomposition scheme, and an aggressive measurement-based predictive load balancing framework, we have attained speeds and speedups that are much higher than any reported in literature so far. The paper first summarizes the broad methodology we are pursuing, and the basic parallelization scheme we used. It then describes the optimizations that were instrumental in increasing performance, and presents performance results on benchmark simulations.

  3. Large-scale parallel genome assembler over cloud computing environment. (United States)

    Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong


    The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

  4. A parallel processing VLSI BAM engine. (United States)

    Hasan, S R; Siong, N K


    In this paper emerging parallel/distributed architectures are explored for the digital VLSI implementation of adaptive bidirectional associative memory (BAM) neural network. A single instruction stream many data stream (SIMD)-based parallel processing architecture, is developed for the adaptive BAM neural network, taking advantage of the inherent parallelism in BAM. This novel neural processor architecture is named the sliding feeder BAM array processor (SLiFBAM). The SLiFBAM processor can be viewed as a two-stroke neural processing engine, It has four operating modes: learn pattern, evaluate pattern, read weight, and write weight. Design of a SLiFBAM VLSI processor chip is also described. By using 2-mum scalable CMOS technology, a SLiFBAM processor chip with 4+4 neurons and eight modules of 256x5 bit local weight-storage SRAM, was integrated on a 6.9x7.4 mm(2) prototype die. The system architecture is highly flexible and modular, enabling the construction of larger BAM networks of up to 252 neurons using multiple SLiFBAM chips.

  5. Parallelism in matrix computations

    CERN Document Server

    Gallopoulos, Efstratios; Sameh, Ahmed H


    This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...

  6. Parallelization in Modern C++

    CERN Multimedia

    CERN. Geneva


    The traditionally used and well established parallel programming models OpenMP and MPI are both targeting lower level parallelism and are meant to be as language agnostic as possible. For a long time, those models were the only widely available portable options for developing parallel C++ applications beyond using plain threads. This has strongly limited the optimization capabilities of compilers, has inhibited extensibility and genericity, and has restricted the use of those models together with other, modern higher level abstractions introduced by the C++11 and C++14 standards. The recent revival of interest in the industry and wider community for the C++ language has also spurred a remarkable amount of standardization proposals and technical specifications being developed. Those efforts however have so far failed to build a vision on how to seamlessly integrate various types of parallelism, such as iterative parallel execution, task-based parallelism, asynchronous many-task execution flows, continuation s...

  7. Scalable Adaptive Multilevel Solvers for Multiphysics Problems

    Energy Technology Data Exchange (ETDEWEB)

    Xu, Jinchao [Pennsylvania State Univ., University Park, PA (United States). Dept. of Mathematics


    In this project, we carried out many studies on adaptive and parallel multilevel methods for numerical modeling for various applications, including Magnetohydrodynamics (MHD) and complex fluids. We have made significant efforts and advances in adaptive multilevel methods of the multiphysics problems: multigrid methods, adaptive finite element methods, and applications.

  8. Computational chaos in massively parallel neural networks (United States)

    Barhen, Jacob; Gulati, Sandeep


    A fundamental issue which directly impacts the scalability of current theoretical neural network models to massively parallel embodiments, in both software as well as hardware, is the inherent and unavoidable concurrent asynchronicity of emerging fine-grained computational ensembles and the possible emergence of chaotic manifestations. Previous analyses attributed dynamical instability to the topology of the interconnection matrix, to parasitic components or to propagation delays. However, researchers have observed the existence of emergent computational chaos in a concurrently asynchronous framework, independent of the network topology. Researcher present a methodology enabling the effective asynchronous operation of large-scale neural networks. Necessary and sufficient conditions guaranteeing concurrent asynchronous convergence are established in terms of contracting operators. Lyapunov exponents are computed formally to characterize the underlying nonlinear dynamics. Simulation results are presented to illustrate network convergence to the correct results, even in the presence of large delays.

  9. Parallel Algorithms and Patterns

    Energy Technology Data Exchange (ETDEWEB)

    Robey, Robert W. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)


    This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.

  10. PCLIPS: Parallel CLIPS (United States)

    Hall, Lawrence O.; Bennett, Bonnie H.; Tello, Ivan


    A parallel version of CLIPS 5.1 has been developed to run on Intel Hypercubes. The user interface is the same as that for CLIPS with some added commands to allow for parallel calls. A complete version of CLIPS runs on each node of the hypercube. The system has been instrumented to display the time spent in the match, recognize, and act cycles on each node. Only rule-level parallelism is supported. Parallel commands enable the assertion and retraction of facts to/from remote nodes working memory. Parallel CLIPS was used to implement a knowledge-based command, control, communications, and intelligence (C(sup 3)I) system to demonstrate the fusion of high-level, disparate sources. We discuss the nature of the information fusion problem, our approach, and implementation. Parallel CLIPS has also be used to run several benchmark parallel knowledge bases such as one to set up a cafeteria. Results show from running Parallel CLIPS with parallel knowledge base partitions indicate that significant speed increases, including superlinear in some cases, are possible.

  11. Parallel MR imaging. (United States)

    Deshmane, Anagha; Gulani, Vikas; Griswold, Mark A; Seiberlich, Nicole


    Parallel imaging is a robust method for accelerating the acquisition of magnetic resonance imaging (MRI) data, and has made possible many new applications of MR imaging. Parallel imaging works by acquiring a reduced amount of k-space data with an array of receiver coils. These undersampled data can be acquired more quickly, but the undersampling leads to aliased images. One of several parallel imaging algorithms can then be used to reconstruct artifact-free images from either the aliased images (SENSE-type reconstruction) or from the undersampled data (GRAPPA-type reconstruction). The advantages of parallel imaging in a clinical setting include faster image acquisition, which can be used, for instance, to shorten breath-hold times resulting in fewer motion-corrupted examinations. In this article the basic concepts behind parallel imaging are introduced. The relationship between undersampling and aliasing is discussed and two commonly used parallel imaging methods, SENSE and GRAPPA, are explained in detail. Examples of artifacts arising from parallel imaging are shown and ways to detect and mitigate these artifacts are described. Finally, several current applications of parallel imaging are presented and recent advancements and promising research in parallel imaging are briefly reviewed. Copyright © 2012 Wiley Periodicals, Inc.

  12. Oracle database performance and scalability a quantitative approach

    CERN Document Server

    Liu, Henry H


    A data-driven, fact-based, quantitative text on Oracle performance and scalability With database concepts and theories clearly explained in Oracle's context, readers quickly learn how to fully leverage Oracle's performance and scalability capabilities at every stage of designing and developing an Oracle-based enterprise application. The book is based on the author's more than ten years of experience working with Oracle, and is filled with dependable, tested, and proven performance optimization techniques. Oracle Database Performance and Scalability is divided into four parts that enable reader

  13. A scalable geometric multigrid solver for nonsymmetric elliptic systems with application to variable-density flows (United States)

    Esmaily, M.; Jofre, L.; Mani, A.; Iaccarino, G.


    A geometric multigrid algorithm is introduced for solving nonsymmetric linear systems resulting from the discretization of the variable density Navier-Stokes equations on nonuniform structured rectilinear grids and high-Reynolds number flows. The restriction operation is defined such that the resulting system on the coarser grids is symmetric, thereby allowing for the use of efficient smoother algorithms. To achieve an optimal rate of convergence, the sequence of interpolation and restriction operations are determined through a dynamic procedure. A parallel partitioning strategy is introduced to minimize communication while maintaining the load balance between all processors. To test the proposed algorithm, we consider two cases: 1) homogeneous isotropic turbulence discretized on uniform grids and 2) turbulent duct flow discretized on stretched grids. Testing the algorithm on systems with up to a billion unknowns shows that the cost varies linearly with the number of unknowns. This O (N) behavior confirms the robustness of the proposed multigrid method regarding ill-conditioning of large systems characteristic of multiscale high-Reynolds number turbulent flows. The robustness of our method to density variations is established by considering cases where density varies sharply in space by a factor of up to 104, showing its applicability to two-phase flow problems. Strong and weak scalability studies are carried out, employing up to 30,000 processors, to examine the parallel performance of our implementation. Excellent scalability of our solver is shown for a granularity as low as 104 to 105 unknowns per processor. At its tested peak throughput, it solves approximately 4 billion unknowns per second employing over 16,000 processors with a parallel efficiency higher than 50%.

  14. Parallel iterative solvers and preconditioners using approximate hierarchical methods

    Energy Technology Data Exchange (ETDEWEB)

    Grama, A.; Kumar, V.; Sameh, A. [Univ. of Minnesota, Minneapolis, MN (United States)


    In this paper, we report results of the performance, convergence, and accuracy of a parallel GMRES solver for Boundary Element Methods. The solver uses a hierarchical approximate matrix-vector product based on a hybrid Barnes-Hut / Fast Multipole Method. We study the impact of various accuracy parameters on the convergence and show that with minimal loss in accuracy, our solver yields significant speedups. We demonstrate the excellent parallel efficiency and scalability of our solver. The combined speedups from approximation and parallelism represent an improvement of several orders in solution time. We also develop fast and paralellizable preconditioners for this problem. We report on the performance of an inner-outer scheme and a preconditioner based on truncated Green`s function. Experimental results on a 256 processor Cray T3D are presented.

  15. Engineering-Based Thermal CFD Simulations on Massive Parallel Systems

    KAUST Repository

    Frisch, Jérôme


    The development of parallel Computational Fluid Dynamics (CFD) codes is a challenging task that entails efficient parallelization concepts and strategies in order to achieve good scalability values when running those codes on modern supercomputers with several thousands to millions of cores. In this paper, we present a hierarchical data structure for massive parallel computations that supports the coupling of a Navier–Stokes-based fluid flow code with the Boussinesq approximation in order to address complex thermal scenarios for energy-related assessments. The newly designed data structure is specifically designed with the idea of interactive data exploration and visualization during runtime of the simulation code; a major shortcoming of traditional high-performance computing (HPC) simulation codes. We further show and discuss speed-up values obtained on one of Germany’s top-ranked supercomputers with up to 140,000 processes and present simulation results for different engineering-based thermal problems.

  16. Fundamental Parallel Algorithms for Private-Cache Chip Multiprocessors

    DEFF Research Database (Denmark)

    Arge, Lars Allan; Goodrich, Michael T.; Nelson, Michael


    In this paper, we study parallel algorithms for private-cache chip multiprocessors (CMPs), focusing on methods for foundational problems that are scalable with the number of cores. By focusing on private-cache CMPs, we show that we can design efficient algorithms that need no additional assumptions...... about the way cores are interconnected, for we assume that all inter-processor communication occurs through the memory hierarchy. We study several fundamental problems, including prefix sums, selection, and sorting, which often form the building blocks of other parallel algorithms. Indeed, we present...... two sorting algorithms, a distribution sort and a mergesort. Our algorithms are asymptotically optimal in terms of parallel cache accesses and space complexity under reasonable assumptions about the relationships between the number of processors, the size of memory, and the size of cache blocks...

  17. Parallel alternating direction preconditioner for isogeometric simulations of explicit dynamics

    KAUST Repository

    Łoś, Marcin


    In this paper we present a parallel implementation of the alternating direction preconditioner for isogeometric simulations of explicit dynamics. The Alternating Direction Implicit (ADI) algorithm, belongs to the category of matrix-splitting iterative methods, was proposed almost six decades ago for solving parabolic and elliptic partial differential equations, see [1–4]. The new version of this algorithm has been recently developed for isogeometric simulations of two dimensional explicit dynamics [5] and steady-state diffusion equations with orthotropic heterogenous coefficients [6]. In this paper we present a parallel version of the alternating direction implicit algorithm for three dimensional simulations. The algorithm has been incorporated as a part of PETIGA an isogeometric framework [7] build on top of PETSc [8]. We show the scalability of the parallel algorithm on STAMPEDE linux cluster up to 10,000 processors, as well as the convergence rate of the PCG solver with ADI algorithm as preconditioner.

  18. Building Scalable Knowledge Graphs for Earth Science (United States)

    Ramachandran, R.; Maskey, M.; Gatlin, P. N.; Zhang, J.; Duan, X.; Bugbee, K.; Christopher, S. A.; Miller, J. J.


    Estimates indicate that the world's information will grow by 800% in the next five years. In any given field, a single researcher or a team of researchers cannot keep up with this rate of knowledge expansion without the help of cognitive systems. Cognitive computing, defined as the use of information technology to augment human cognition, can help tackle large systemic problems. Knowledge graphs, one of the foundational components of cognitive systems, link key entities in a specific domain with other entities via relationships. Researchers could mine these graphs to make probabilistic recommendations and to infer new knowledge. At this point, however, there is a dearth of tools to generate scalable Knowledge graphs using existing corpus of scientific literature for Earth science research. Our project is currently developing an end-to-end automated methodology for incrementally constructing Knowledge graphs for Earth Science. Semantic Entity Recognition (SER) is one of the key steps in this methodology. SER for Earth Science uses external resources (including metadata catalogs and controlled vocabulary) as references to guide entity extraction and recognition (i.e., labeling) from unstructured text, in order to build a large training set to seed the subsequent auto-learning component in our algorithm. Results from several SER experiments will be presented as well as lessons learned.

  19. Scalable Notch Antenna System for Multiport Applications

    Directory of Open Access Journals (Sweden)

    Abdurrahim Toktas


    Full Text Available A novel and compact scalable antenna system is designed for multiport applications. The basic design is built on a square patch with an electrical size of 0.82λ0×0.82λ0 (at 2.4 GHz on a dielectric substrate. The design consists of four symmetrical and orthogonal triangular notches with circular feeding slots at the corners of the common patch. The 4-port antenna can be simply rearranged to 8-port and 12-port systems. The operating band of the system can be tuned by scaling (S the size of the system while fixing the thickness of the substrate. The antenna system with S: 1/1 in size of 103.5×103.5 mm2 operates at the frequency band of 2.3–3.0 GHz. By scaling the antenna with S: 1/2.3, a system of 45×45 mm2 is achieved, and thus the operating band is tuned to 4.7–6.1 GHz with the same scattering characteristic. A parametric study is also conducted to investigate the effects of changing the notch dimensions. The performance of the antenna is verified in terms of the antenna characteristics as well as diversity and multiplexing parameters. The antenna system can be tuned by scaling so that it is applicable to the multiport WLAN, WIMAX, and LTE devices with port upgradability.

  20. Scalability and interoperability within glideinWMS

    International Nuclear Information System (INIS)

    Bradley, D.; Sfiligoi, I.; Padhi, S.; Frey, J.; Tannenbaum, T.


    Physicists have access to thousands of CPUs in grid federations such as OSG and EGEE. With the start-up of the LHC, it is essential for individuals or groups of users to wrap together available resources from multiple sites across multiple grids under a higher user-controlled layer in order to provide a homogeneous pool of available resources. One such system is glideinWMS, which is based on the Condor batch system. A general discussion of glideinWMS can be found elsewhere. Here, we focus on recent advances in extending its reach: scalability and integration of heterogeneous compute elements. We demonstrate that the new developments exceed the design goal of over 10,000 simultaneous running jobs under a single Condor schedd, using strong security protocols across global networks, and sustaining a steady-state job completion rate of a few Hz. We also show interoperability across heterogeneous computing elements achieved using client-side methods. We discuss this technique and the challenges in direct access to NorduGrid and CREAM compute elements, in addition to Globus based systems.

  1. A Programmable, Scalable-Throughput Interleaver

    Directory of Open Access Journals (Sweden)

    Rijshouwer EJC


    Full Text Available The interleaver stages of digital communication standards show a surprisingly large variation in throughput, state sizes, and permutation functions. Furthermore, data rates for 4G standards such as LTE-Advanced will exceed typical baseband clock frequencies of handheld devices. Multistream operation for Software Defined Radio and iterative decoding algorithms will call for ever higher interleave data rates. Our interleave machine is built around 8 single-port SRAM banks and can be programmed to generate up to 8 addresses every clock cycle. The scalable architecture combines SIMD and VLIW concepts with an efficient resolution of bank conflicts. A wide range of cellular, connectivity, and broadcast interleavers have been mapped on this machine, with throughputs up to more than 0.5 Gsymbol/second. Although it was designed for channel interleaving, the application domain of the interleaver extends also to Turbo interleaving. The presented configuration of the architecture is designed as a part of a programmable outer receiver on a prototype board. It offers (near universal programmability to enable the implementation of new interleavers. The interleaver measures 2.09 m in 65 nm CMOS (including memories and proves functional on silicon.

  2. Scalable Combinatorial Tools for Health Disparities Research

    Directory of Open Access Journals (Sweden)

    Michael A. Langston


    Full Text Available Despite staggering investments made in unraveling the human genome, current estimates suggest that as much as 90% of the variance in cancer and chronic diseases can be attributed to factors outside an individual’s genetic endowment, particularly to environmental exposures experienced across his or her life course. New analytical approaches are clearly required as investigators turn to complicated systems theory and ecological, place-based and life-history perspectives in order to understand more clearly the relationships between social determinants, environmental exposures and health disparities. While traditional data analysis techniques remain foundational to health disparities research, they are easily overwhelmed by the ever-increasing size and heterogeneity of available data needed to illuminate latent gene x environment interactions. This has prompted the adaptation and application of scalable combinatorial methods, many from genome science research, to the study of population health. Most of these powerful tools are algorithmically sophisticated, highly automated and mathematically abstract. Their utility motivates the main theme of this paper, which is to describe real applications of innovative transdisciplinary models and analyses in an effort to help move the research community closer toward identifying the causal mechanisms and associated environmental contexts underlying health disparities. The public health exposome is used as a contemporary focus for addressing the complex nature of this subject.

  3. Scalability of Direct Solver for Non-stationary Cahn-Hilliard Simulations with Linearized time Integration Scheme

    KAUST Repository

    Woźniak, M.


    We study the features of a new mixed integration scheme dedicated to solving the non-stationary variational problems. The scheme is composed of the FEM approximation with respect to the space variable coupled with a 3-leveled time integration scheme with a linearized right-hand side operator. It was applied in solving the Cahn-Hilliard parabolic equation with a nonlinear, fourth-order elliptic part. The second order of the approximation along the time variable was proven. Moreover, the good scalability of the software based on this scheme was confirmed during simulations. We verify the proposed time integration scheme by monitoring the Ginzburg-Landau free energy. The numerical simulations are performed by using a parallel multi-frontal direct solver executed over STAMPEDE Linux cluster. Its scalability was compared to the results of the three direct solvers, including MUMPS, SuperLU and PaSTiX.

  4. Scalable air cathode microbial fuel cells using glass fiber separators, plastic mesh supporters, and graphite fiber brush anodes

    KAUST Repository

    Zhang, Xiaoyuan


    The combined use of brush anodes and glass fiber (GF1) separators, and plastic mesh supporters were used here for the first time to create a scalable microbial fuel cell architecture. Separators prevented short circuiting of closely-spaced electrodes, and cathode supporters were used to avoid water gaps between the separator and cathode that can reduce power production. The maximum power density with a separator and supporter and a single cathode was 75±1W/m3. Removing the separator decreased power by 8%. Adding a second cathode increased power to 154±1W/m3. Current was increased by connecting two MFCs connected in parallel. These results show that brush anodes, combined with a glass fiber separator and a plastic mesh supporter, produce a useful MFC architecture that is inherently scalable due to good insulation between the electrodes and a compact architecture. © 2010 Elsevier Ltd.

  5. A Novel Coarsening Method for Scalable and Efficient Mesh Generation

    Energy Technology Data Exchange (ETDEWEB)

    Yoo, A; Hysom, D; Gunney, B


    matrix-vector multiplication can be performed locally on each processor and hence to minimize communication. Furthermore, a good graph partitioning scheme ensures the equal amount of computation performed on each processor. Graph partitioning is a well known NP-complete problem, and thus the most commonly used graph partitioning algorithms employ some forms of heuristics. These algorithms vary in terms of their complexity, partition generation time, and the quality of partitions, and they tend to trade off these factors. A significant challenge we are currently facing at the Lawrence Livermore National Laboratory is how to partition very large meshes on massive-size distributed memory machines like IBM BlueGene/P, where scalability becomes a big issue. For example, we have found that the ParMetis, a very popular graph partitioning tool, can only scale to 16K processors. An ideal graph partitioning method on such an environment should be fast and scale to very large meshes, while producing high quality partitions. This is an extremely challenging task, as to scale to that level, the partitioning algorithm should be simple and be able to produce partitions that minimize inter-processor communications and balance the load imposed on the processors. Our goals in this work are two-fold: (1) To develop a new scalable graph partitioning method with good load balancing and communication reduction capability. (2) To study the performance of the proposed partitioning method on very large parallel machines using actual data sets and compare the performance to that of existing methods. The proposed method achieves the desired scalability by reducing the mesh size. For this, it coarsens an input mesh into a smaller size mesh by coalescing the vertices and edges of the original mesh into a set of mega-vertices and mega-edges. A new coarsening method called brick algorithm is developed in this research. In the brick algorithm, the zones in a given mesh are first grouped into fixed size

  6. Parallel reservoir simulator computations

    International Nuclear Information System (INIS)

    Hemanth-Kumar, K.; Young, L.C.


    The adaptation of a reservoir simulator for parallel computations is described. The simulator was originally designed for vector processors. It performs approximately 99% of its calculations in vector/parallel mode and relative to scalar calculations it achieves speedups of 65 and 81 for black oil and EOS simulations, respectively on the CRAY C-90

  7. Systematic and Scalable Testing of Concurrent Programs (United States)


    Principles (SOSP ’07), pages 103–116, New York, NY, USA, 2007. ACM. [Cited on page 86.] [97] Shan Lu, Soyeon Park, Eunsoo Seo , and Yuanyuan html . [Cited on page 80.] [105] Madanlal Musuvathi, David Y. W. Park, Andy Chou, Dawson R. Engler, and David L. Dill. CMC: A...Parallel File Scanner [online]. 2013. URL: [Cited on page 80.] [114] Chang- Seo Park and Koushik Sen. Randomized Active

  8. Parallel computing works

    Energy Technology Data Exchange (ETDEWEB)


    An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.

  9. Totally parallel multilevel algorithms (United States)

    Frederickson, Paul O.


    Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.

  10. Heterogeneous scalable framework for multiphase flows

    Energy Technology Data Exchange (ETDEWEB)

    Morris, Karla Vanessa [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)


    Two categories of challenges confront the developer of computational spray models: those related to the computation and those related to the physics. Regarding the computation, the trend towards heterogeneous, multi- and many-core platforms will require considerable re-engineering of codes written for the current supercomputing platforms. Regarding the physics, accurate methods for transferring mass, momentum and energy from the dispersed phase onto the carrier fluid grid have so far eluded modelers. Significant challenges also lie at the intersection between these two categories. To be competitive, any physics model must be expressible in a parallel algorithm that performs well on evolving computer platforms. This work created an application based on a software architecture where the physics and software concerns are separated in a way that adds flexibility to both. The develop spray-tracking package includes an application programming interface (API) that abstracts away the platform-dependent parallelization concerns, enabling the scientific programmer to write serial code that the API resolves into parallel processes and threads of execution. The project also developed the infrastructure required to provide similar APIs to other application. The API allow object-oriented Fortran applications direct interaction with Trilinos to support memory management of distributed objects in central processing units (CPU) and graphic processing units (GPU) nodes for applications using C++.

  11. Scalable IC Platform for Smart Cameras

    Directory of Open Access Journals (Sweden)

    Harry Broers


    Full Text Available Smart cameras are among the emerging new fields of electronics. The points of interest are in the application areas, software and IC development. In order to reduce cost, it is worthwhile to invest in a single architecture that can be scaled for the various application areas in performance (and resulting power consumption. In this paper, we show that the combination of an SIMD (single-instruction multiple-data processor and a general-purpose DSP is very advantageous for the image processing tasks encountered in smart cameras. While the SIMD processor gives the very high performance necessary by exploiting the inherent data parallelism found in the pixel crunching part of the algorithms, the DSP offers a friendly approach to the more complex tasks. The paper continues to motivate that SIMD processors have very convenient scaling properties in silicon, making the complete, SIMD-DSP architecture suitable for different application areas without changing the software suite. Analysis of the changes in power consumption due to scaling shows that for typical image processing tasks, it is beneficial to scale the SIMD processor to use the maximum level of parallelism available in the algorithm if the IC supply voltage can be lowered. If silicon cost is of importance, the parallelism of the processor should be scaled to just reach the desired performance given the speed of the silicon.

  12. A parallel algorithm for 3D particle tracking and Lagrangian trajectory reconstruction

    International Nuclear Information System (INIS)

    Barker, Douglas; Zhang, Yuanhui; Lifflander, Jonathan; Arya, Anshu


    Particle-tracking methods are widely used in fluid mechanics and multi-target tracking research because of their unique ability to reconstruct long trajectories with high spatial and temporal resolution. Researchers have recently demonstrated 3D tracking of several objects in real time, but as the number of objects is increased, real-time tracking becomes impossible due to data transfer and processing bottlenecks. This problem may be solved by using parallel processing. In this paper, a parallel-processing framework has been developed based on frame decomposition and is programmed using the asynchronous object-oriented Charm++ paradigm. This framework can be a key step in achieving a scalable Lagrangian measurement system for particle-tracking velocimetry and may lead to real-time measurement capabilities. The parallel tracking algorithm was evaluated with three data sets including the particle image velocimetry standard 3D images data set #352, a uniform data set for optimal parallel performance and a computational-fluid-dynamics-generated non-uniform data set to test trajectory reconstruction accuracy, consistency with the sequential version and scalability to more than 500 processors. The algorithm showed strong scaling up to 512 processors and no inherent limits of scalability were seen. Ultimately, up to a 200-fold speedup is observed compared to the serial algorithm when 256 processors were used. The parallel algorithm is adaptable and could be easily modified to use any sequential tracking algorithm, which inputs frames of 3D particle location data and outputs particle trajectories

  13. A massively parallel GPU-accelerated model for analysis of fully nonlinear free surface waves

    DEFF Research Database (Denmark)

    Engsig-Karup, Allan Peter; Madsen, Morten G.; Glimberg, Stefan Lemvig


    We implement and evaluate a massively parallel and scalable algorithm based on a multigrid preconditioned Defect Correction method for the simulation of fully nonlinear free surface flows. The simulations are based on a potential model that describes wave propagation over uneven bottoms in three...... space dimensions and is useful for fast analysis and prediction purposes in coastal and offshore engineering. A dedicated numerical model based on the proposed algorithm is executed in parallel by utilizing affordable modern special purpose graphics processing unit (GPU). The model is based on a low......-storage flexible-order accurate finite difference method that is known to be efficient and scalable on a CPU core (single thread). To achieve parallel performance of the relatively complex numerical model, we investigate a new trend in high-performance computing where many-core GPUs are utilized as high...

  14. Scalable Concurrency Control and Recovery for Shared Storage Arrays

    National Research Council Canada - National Science Library

    Amiri, Khalil


    Shared storage arrays enable thousands of storage devices to be shared and directly accessed by end hosts over switched system area networks, promising databases and file systems highly scalable, reliable storage...

  15. Scalable Partitioning Algorithms for FPGAs With Heterogeneous Resources

    National Research Council Canada - National Science Library

    Selvakkumaran, Navaratnasothie; Ranjan, Abhishek; Raje, Salil; Karypis, George


    As FPGA densities increase, partitioning-based FPGA placement approaches are becoming increasingly important as they can be used to provide high-quality and computationally scalable placement solutions...

  16. Supporting scalable Bayesian networks using configurable discretizer actuators

    CSIR Research Space (South Africa)

    Osunmakinde, I


    Full Text Available memory management schemes to provide a complete scalable basis for the optimization strategy. This prevents the limited memory from halting while minimizing the discretization time and adapting new observations without re-scanning the entire old data...

  17. PSOM2—partitioning-based scalable ontology matching using ...

    Indian Academy of Sciences (India)

    B Sathiya


    -based systems to reduce the matching space. ... reduction in execution time, leading to an effective and scalable ontology matching system. Keywords. ... ontology matching results, collaborative and social ontol- ogy matching ...

  18. ARC Code TI: Block-GP: Scalable Gaussian Process Regression (United States)

    National Aeronautics and Space Administration — Block GP is a Gaussian Process regression framework for multimodal data, that can be an order of magnitude more scalable than existing state-of-the-art nonlinear...

  19. Scalable Quantum Networks for Distributed Computing and Sensing (United States)


    AFRL-AFOSR-UK-TR-2016-0007 Scalable Quantum Networks for Distributed Computing and Sensing Ian Walmsley THE UNIVERSITY OF OXFORD Final Report 04/01...photon. 15. SUBJECT TERMS EOARD, quantum information processing, quantum computation , photonics, quantum networks, quantum memory 16. ANSI Std. Z39.18 Final report for “Scalable Quantum Networks for Distributed Computing and Sensing” Project 12-2076; Sept 2012 through Aug 2015

  20. SOL: A Library for Scalable Online Learning Algorithms


    Wu, Yue; Hoi, Steven C. H.; Liu, Chenghao; Lu, Jing; Sahoo, Doyen; Yu, Nenghai


    SOL is an open-source library for scalable online learning algorithms, and is particularly suitable for learning with high-dimensional data. The library provides a family of regular and sparse online learning algorithms for large-scale binary and multi-class classification tasks with high efficiency, scalability, portability, and extensibility. SOL was implemented in C++, and provided with a collection of easy-to-use command-line tools, python wrappers and library calls for users and develope...

  1. TriG: Next Generation Scalable Spaceborne GNSS Receiver (United States)

    Tien, Jeffrey Y.; Okihiro, Brian Bachman; Esterhuizen, Stephan X.; Franklin, Garth W.; Meehan, Thomas K.; Munson, Timothy N.; Robison, David E.; Turbiner, Dmitry; Young, Lawrence E.


    TriG is the next generation NASA scalable space GNSS Science Receiver. It will track all GNSS and additional signals (i.e. GPS, GLONASS, Galileo, Compass and Doris). Scalable 3U architecture and fully software and firmware recofigurable, enabling optimization to meet specific mission requirements. TriG GNSS EM is currently undergoing testing and is expected to complete full performance testing later this year.

  2. Nested Parallelism: Allocation of Threads to Tasks and OpenMP Implementation

    Directory of Open Access Journals (Sweden)

    Ragnhild Blikberg


    Full Text Available In this paper we discuss the use of nested parallelism. Our claim is that if the problem naturally possesses multiple levels of parallelism, then applying parallelism to all levels may significantly enhance the scalability of your algorithm. This claim is sustained by numerical experiments. We also discuss how to implement multi-level parallelism using OpenMP. We find current OpenMP implementation, based on version 1.0, to have severe limitation for implementing nested parallelization. We then show how this can be circumvented by explicitly assign task to threads. Load balancing issues become more complicated with two (or more levels of parallelism. To handle this problem, we have designed a distribution algorithm which groups threads into teams, each team being responsible for one course grain outer-level task. This algorithm is proven to produce the optimal load balance, under given assumptions.

  3. Responsive, Flexible and Scalable Broader Impacts (Invited) (United States)

    Decharon, A.; Companion, C.; Steinman, M.


    In many educator professional development workshops, scientists present content in a slideshow-type format and field questions afterwards. Drawbacks of this approach include: inability to begin the lecture with content that is responsive to audience needs; lack of flexible access to specific material within the linear presentation; and “Q&A” sessions are not easily scalable to broader audiences. Often this type of traditional interaction provides little direct benefit to the scientists. The Centers for Ocean Sciences Education Excellence - Ocean Systems (COSEE-OS) applies the technique of concept mapping with demonstrated effectiveness in helping scientists and educators “get on the same page” (deCharon et al., 2009). A key aspect is scientist professional development geared towards improving face-to-face and online communication with non-scientists. COSEE-OS promotes scientist-educator collaboration, tests the application of scientist-educator maps in new contexts through webinars, and is piloting the expansion of maps as long-lived resources for the broader community. Collaboration - COSEE-OS has developed and tested a workshop model bringing scientists and educators together in a peer-oriented process, often clarifying common misconceptions. Scientist-educator teams develop online concept maps that are hyperlinked to “assets” (i.e., images, videos, news) and are responsive to the needs of non-scientist audiences. In workshop evaluations, 91% of educators said that the process of concept mapping helped them think through science topics and 89% said that concept mapping helped build a bridge of communication with scientists (n=53). Application - After developing a concept map, with COSEE-OS staff assistance, scientists are invited to give webinar presentations that include live “Q&A” sessions. The webinars extend the reach of scientist-created concept maps to new contexts, both geographically and topically (e.g., oil spill), with a relatively small

  4. Microscopic Characterization of Scalable Coherent Rydberg Superatoms (United States)

    Zeiher, Johannes; Schauß, Peter; Hild, Sebastian; Macrı, Tommaso; Bloch, Immanuel; Gross, Christian


    Strong interactions can amplify quantum effects such that they become important on macroscopic scales. Controlling these coherently on a single-particle level is essential for the tailored preparation of strongly correlated quantum systems and opens up new prospects for quantum technologies. Rydberg atoms offer such strong interactions, which lead to extreme nonlinearities in laser-coupled atomic ensembles. As a result, multiple excitation of a micrometer-sized cloud can be blocked while the light-matter coupling becomes collectively enhanced. The resulting two-level system, often called a "superatom," is a valuable resource for quantum information, providing a collective qubit. Here, we report on the preparation of 2 orders of magnitude scalable superatoms utilizing the large interaction strength provided by Rydberg atoms combined with precise control of an ensemble of ultracold atoms in an optical lattice. The latter is achieved with sub-shot-noise precision by local manipulation of a two-dimensional Mott insulator. We microscopically confirm the superatom picture by in situ detection of the Rydberg excitations and observe the characteristic square-root scaling of the optical coupling with the number of atoms. Enabled by the full control over the atomic sample, including the motional degrees of freedom, we infer the overlap of the produced many-body state with a W state from the observed Rabi oscillations and deduce the presence of entanglement. Finally, we investigate the breakdown of the superatom picture when two Rydberg excitations are present in the system, which leads to dephasing and a loss of coherence.

  5. Microscopic Characterization of Scalable Coherent Rydberg Superatoms

    Directory of Open Access Journals (Sweden)

    Johannes Zeiher


    Full Text Available Strong interactions can amplify quantum effects such that they become important on macroscopic scales. Controlling these coherently on a single-particle level is essential for the tailored preparation of strongly correlated quantum systems and opens up new prospects for quantum technologies. Rydberg atoms offer such strong interactions, which lead to extreme nonlinearities in laser-coupled atomic ensembles. As a result, multiple excitation of a micrometer-sized cloud can be blocked while the light-matter coupling becomes collectively enhanced. The resulting two-level system, often called a “superatom,” is a valuable resource for quantum information, providing a collective qubit. Here, we report on the preparation of 2 orders of magnitude scalable superatoms utilizing the large interaction strength provided by Rydberg atoms combined with precise control of an ensemble of ultracold atoms in an optical lattice. The latter is achieved with sub-shot-noise precision by local manipulation of a two-dimensional Mott insulator. We microscopically confirm the superatom picture by in situ detection of the Rydberg excitations and observe the characteristic square-root scaling of the optical coupling with the number of atoms. Enabled by the full control over the atomic sample, including the motional degrees of freedom, we infer the overlap of the produced many-body state with a W state from the observed Rabi oscillations and deduce the presence of entanglement. Finally, we investigate the breakdown of the superatom picture when two Rydberg excitations are present in the system, which leads to dephasing and a loss of coherence.

  6. Scalable persistent identifier systems for dynamic datasets (United States)

    Golodoniuc, P.; Cox, S. J. D.; Klump, J. F.


    Reliable and persistent identification of objects, whether tangible or not, is essential in information management. Many Internet-based systems have been developed to identify digital data objects, e.g., PURL, LSID, Handle, ARK. These were largely designed for identification of static digital objects. The amount of data made available online has grown exponentially over the last two decades and fine-grained identification of dynamically generated data objects within large datasets using conventional systems (e.g., PURL) has become impractical. We have compared capabilities of various technological solutions to enable resolvability of data objects in dynamic datasets, and developed a dataset-centric approach to resolution of identifiers. This is particularly important in Semantic Linked Data environments where dynamic frequently changing data is delivered live via web services, so registration of individual data objects to obtain identifiers is impractical. We use identifier patterns and pattern hierarchies for identification of data objects, which allows relationships between identifiers to be expressed, and also provides means for resolving a single identifier into multiple forms (i.e. views or representations of an object). The latter can be implemented through (a) HTTP content negotiation, or (b) use of URI querystring parameters. The pattern and hierarchy approach has been implemented in the Linked Data API supporting the United Nations Spatial Data Infrastructure (UNSDI) initiative and later in the implementation of geoscientific data delivery for the Capricorn Distal Footprints project using International Geo Sample Numbers (IGSN). This enables flexible resolution of multi-view persistent identifiers and provides a scalable solution for large heterogeneous datasets.

  7. A Scalable Version of the Navy Operational Global Atmospheric Prediction System Spectral Forecast Model

    Directory of Open Access Journals (Sweden)

    Thomas E. Rosmond


    Full Text Available The Navy Operational Global Atmospheric Prediction System (NOGAPS includes a state-of-the-art spectral forecast model similar to models run at several major operational numerical weather prediction (NWP centers around the world. The model, developed by the Naval Research Laboratory (NRL in Monterey, California, has run operational at the Fleet Numerical Meteorological and Oceanographic Center (FNMOC since 1982, and most recently is being run on a Cray C90 in a multi-tasked configuration. Typically the multi-tasked code runs on 10 to 15 processors with overall parallel efficiency of about 90%. resolution is T159L30, but other operational and research applications run at significantly lower resolutions. A scalable NOGAPS forecast model has been developed by NRL in anticipation of a FNMOC C90 replacement in about 2001, as well as for current NOGAPS research requirements to run on DOD High-Performance Computing (HPC scalable systems. The model is designed to run with message passing (MPI. Model design criteria include bit reproducibility for different processor numbers and reasonably efficient performance on fully shared memory, distributed memory, and distributed shared memory systems for a wide range of model resolutions. Results for a wide range of processor numbers, model resolutions, and different vendor architectures are presented. Single node performance has been disappointing on RISC based systems, at least compared to vector processor performance. This is a common complaint, and will require careful re-examination of traditional numerical weather prediction (NWP model software design and data organization to fully exploit future scalable architectures.

  8. Integration experiments and performance studies of a COTS parallel archive system

    Energy Technology Data Exchange (ETDEWEB)

    Chen, Hsing-bung [Los Alamos National Laboratory; Scott, Cody [Los Alamos National Laboratory; Grider, Gary [Los Alamos National Laboratory; Torres, Aaron [Los Alamos National Laboratory; Turley, Milton [Los Alamos National Laboratory; Sanchez, Kathy [Los Alamos National Laboratory; Bremer, John [Los Alamos National Laboratory


    Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, Is, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petafiop/s computing system, LANL's Roadrunner machine, and demonstrated its capability to address

  9. Integration experiences and performance studies of A COTS parallel archive systems

    Energy Technology Data Exchange (ETDEWEB)

    Chen, Hsing-bung [Los Alamos National Laboratory; Scott, Cody [Los Alamos National Laboratory; Grider, Bary [Los Alamos National Laboratory; Torres, Aaron [Los Alamos National Laboratory; Turley, Milton [Los Alamos National Laboratory; Sanchez, Kathy [Los Alamos National Laboratory; Bremer, John [Los Alamos National Laboratory


    Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf(COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petaflop/s computing system, LANL's Roadrunner, and demonstrated its capability to address requirements of

  10. Scalable conditional induction variables (CIV) analysis

    DEFF Research Database (Denmark)

    Oancea, Cosmin Eugen; Rauchwerger, Lawrence


    Subscripts using induction variables that cannot be expressed as a formula in terms of the enclosing-loop indices appear in the low-level implementation of common programming abstractions such as filter, or stack operations and pose significant challenges to automatic parallelization. Because...... representation. Our technique requires no modifications of our dependence tests, which is agnostic to the original shape of the subscripts, and is more powerful than previously reported dependence tests that rely on the pairwise disambiguation of read-write references. We have implemented the CIV analysis in our...

  11. A parallel solution for high resolution histological image analysis. (United States)

    Bueno, G; González, R; Déniz, O; García-Rojo, M; González-García, J; Fernández-Carrobles, M M; Vállez, N; Salido, J


    This paper describes a general methodology for developing parallel image processing algorithms based on message passing for high resolution images (on the order of several Gigabytes). These algorithms have been applied to histological images and must be executed on massively parallel processing architectures. Advances in new technologies for complete slide digitalization in pathology have been combined with developments in biomedical informatics. However, the efficient use of these digital slide systems is still a challenge. The image processing that these slides are subject to is still limited both in terms of data processed and processing methods. The work presented here focuses on the need to design and develop parallel image processing tools capable of obtaining and analyzing the entire gamut of information included in digital slides. Tools have been developed to assist pathologists in image analysis and diagnosis, and they cover low and high-level image processing methods applied to histological images. Code portability, reusability and scalability have been tested by using the following parallel computing architectures: distributed memory with massive parallel processors and two networks, INFINIBAND and Myrinet, composed of 17 and 1024 nodes respectively. The parallel framework proposed is flexible, high performance solution and it shows that the efficient processing of digital microscopic images is possible and may offer important benefits to pathology laboratories. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.

  12. Space-Filling Supercapacitor Carpets: Highly scalable fractal architecture for energy storage (United States)

    Tiliakos, Athanasios; Trefilov, Alexandra M. I.; Tanasӑ, Eugenia; Balan, Adriana; Stamatin, Ioan


    Revamping ground-breaking ideas from fractal geometry, we propose an alternative micro-supercapacitor configuration realized by laser-induced graphene (LIG) foams produced via laser pyrolysis of inexpensive commercial polymers. The Space-Filling Supercapacitor Carpet (SFSC) architecture introduces the concept of nested electrodes based on the pre-fractal Peano space-filling curve, arranged in a symmetrical equilateral setup that incorporates multiple parallel capacitor cells sharing common electrodes for maximum efficiency and optimal length-to-area distribution. We elucidate on the theoretical foundations of the SFSC architecture, and we introduce innovations (high-resolution vector-mode printing) in the LIG method that allow for the realization of flexible and scalable devices based on low iterations of the Peano algorithm. SFSCs exhibit distributed capacitance properties, leading to capacitance, energy, and power ratings proportional to the number of nested electrodes (up to 4.3 mF, 0.4 μWh, and 0.2 mW for the largest tested model of low iteration using aqueous electrolytes), with competitively high energy and power densities. This can pave the road for full scalability in energy storage, reaching beyond the scale of micro-supercapacitors for incorporating into larger and more demanding applications.

  13. Scalable electrophysiology in intact small animals with nanoscale suspended electrode arrays (United States)

    Gonzales, Daniel L.; Badhiwala, Krishna N.; Vercosa, Daniel G.; Avants, Benjamin W.; Liu, Zheng; Zhong, Weiwei; Robinson, Jacob T.


    Electrical measurements from large populations of animals would help reveal fundamental properties of the nervous system and neurological diseases. Small invertebrates are ideal for these large-scale studies; however, patch-clamp electrophysiology in microscopic animals typically requires invasive dissections and is low-throughput. To overcome these limitations, we present nano-SPEARs: suspended electrodes integrated into a scalable microfluidic device. Using this technology, we have made the first extracellular recordings of body-wall muscle electrophysiology inside an intact roundworm, Caenorhabditis elegans. We can also use nano-SPEARs to record from multiple animals in parallel and even from other species, such as Hydra littoralis. Furthermore, we use nano-SPEARs to establish the first electrophysiological phenotypes for C. elegans models for amyotrophic lateral sclerosis and Parkinson's disease, and show a partial rescue of the Parkinson's phenotype through drug treatment. These results demonstrate that nano-SPEARs provide the core technology for microchips that enable scalable, in vivo studies of neurobiology and neurological diseases.

  14. Novel flat datacenter network architecture based on scalable and flow-controlled optical switch system. (United States)

    Miao, Wang; Luo, Jun; Di Lucente, Stefano; Dorren, Harm; Calabretta, Nicola


    We propose and demonstrate an optical flat datacenter network based on scalable optical switch system with optical flow control. Modular structure with distributed control results in port-count independent optical switch reconfiguration time. RF tone in-band labeling technique allowing parallel processing of the label bits ensures the low latency operation regardless of the switch port-count. Hardware flow control is conducted at optical level by re-using the label wavelength without occupying extra bandwidth, space, and network resources which further improves the performance of latency within a simple structure. Dynamic switching including multicasting operation is validated for a 4 x 4 system. Error free operation of 40 Gb/s data packets has been achieved with only 1 dB penalty. The system could handle an input load up to 0.5 providing a packet loss lower that 10(-5) and an average latency less that 500 ns when a buffer size of 16 packets is employed. Investigation on scalability also indicates that the proposed system could potentially scale up to large port count with limited power penalty.

  15. Algorithms for parallel computers

    International Nuclear Information System (INIS)

    Churchhouse, R.F.


    Until relatively recently almost all the algorithms for use on computers had been designed on the (usually unstated) assumption that they were to be run on single processor, serial machines. With the introduction of vector processors, array processors and interconnected systems of mainframes, minis and micros, however, various forms of parallelism have become available. The advantage of parallelism is that it offers increased overall processing speed but it also raises some fundamental questions, including: (i) which, if any, of the existing 'serial' algorithms can be adapted for use in the parallel mode. (ii) How close to optimal can such adapted algorithms be and, where relevant, what are the convergence criteria. (iii) How can we design new algorithms specifically for parallel systems. (iv) For multi-processor systems how can we handle the software aspects of the interprocessor communications. Aspects of these questions illustrated by examples are considered in these lectures. (orig.)

  16. Parallelism and array processing

    International Nuclear Information System (INIS)

    Zacharov, V.


    Modern computing, as well as the historical development of computing, has been dominated by sequential monoprocessing. Yet there is the alternative of parallelism, where several processes may be in concurrent execution. This alternative is discussed in a series of lectures, in which the main developments involving parallelism are considered, both from the standpoint of computing systems and that of applications that can exploit such systems. The lectures seek to discuss parallelism in a historical context, and to identify all the main aspects of concurrency in computation right up to the present time. Included will be consideration of the important question as to what use parallelism might be in the field of data processing. (orig.)

  17. Scalable microbial fuel cell (MFC) stack for continuous real wastewater treatment. (United States)

    Zhuang, Li; Zheng, Yu; Zhou, Shungui; Yuan, Yong; Yuan, Haoran; Chen, Yong


    A tubular air-cathode microbial fuel cell (MFC) stack with high scalability and low material cost was constructed and the ability of simultaneous real wastewater treatment and bioelectricity generation was investigated under continuous flow mode. At the two organic loading rates (ORLs) tested (1.2 and 4.9kg COD/m(3)d), five non-Pt MFCs connected in series and parallel circuit modes treating swine wastewater can enable an increase of the voltage and the current. The parallel stack retained high power output and the series connection underwent energy loss due to the substrate cross-conduction effect. With continuous electricity production, the parallel stack achieved 83.8% of COD removal and 90.8% of NH(4)(+)-N removal at 1.2kg COD/m(3)d, and 77.1% COD removal and 80.7% NH(4)(+)-N removal at 4.9kg COD/m(3)d. The MFC stack system in this study was demonstrated to be able to treat real wastewater with the added benefit of harvesting electricity energy. Copyright © 2011 Elsevier Ltd. All rights reserved.

  18. Empirical study of parallel LRU simulation algorithms (United States)

    Carr, Eric; Nicol, David M.


    This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.

  19. Parallel magnetic resonance imaging

    International Nuclear Information System (INIS)

    Larkman, David J; Nunes, Rita G


    Parallel imaging has been the single biggest innovation in magnetic resonance imaging in the last decade. The use of multiple receiver coils to augment the time consuming Fourier encoding has reduced acquisition times significantly. This increase in speed comes at a time when other approaches to acquisition time reduction were reaching engineering and human limits. A brief summary of spatial encoding in MRI is followed by an introduction to the problem parallel imaging is designed to solve. There are a large number of parallel reconstruction algorithms; this article reviews a cross-section, SENSE, SMASH, g-SMASH and GRAPPA, selected to demonstrate the different approaches. Theoretical (the g-factor) and practical (coil design) limits to acquisition speed are reviewed. The practical implementation of parallel imaging is also discussed, in particular coil calibration. How to recognize potential failure modes and their associated artefacts are shown. Well-established applications including angiography, cardiac imaging and applications using echo planar imaging are reviewed and we discuss what makes a good application for parallel imaging. Finally, active research areas where parallel imaging is being used to improve data quality by repairing artefacted images are also reviewed. (invited topical review)

  20. Evaluating parallel relational databases for medical data analysis.

    Energy Technology Data Exchange (ETDEWEB)

    Rintoul, Mark Daniel; Wilson, Andrew T.


    Hospitals have always generated and consumed large amounts of data concerning patients, treatment and outcomes. As computers and networks have permeated the hospital environment it has become feasible to collect and organize all of this data. This raises naturally the question of how to deal with the resulting mountain of information. In this report we detail a proof-of-concept test using two commercially available parallel database systems to analyze a set of real, de-identified medical records. We examine database scalability as data sizes increase as well as responsiveness under load from multiple users.

  1. Parallel importance sampling in conditional linear Gaussian networks

    DEFF Research Database (Denmark)

    Salmerón, Antonio; Ramos-López, Darío; Borchani, Hanen


    In this paper we analyse the problem of probabilistic inference in CLG networks when evidence comes in streams. In such situations, fast and scalable algorithms, able to provide accurate responses in a short time are required. We consider the instantiation of variational inference and importance...... sampling, two well known tools for probabilistic inference, to the CLG case. The experimental results over synthetic networks show how a parallel version importance sampling, and more precisely evidence weighting, is a promising scheme, as it is accurate and scales up with respect to available computing...

  2. Protocol-Based Verification of Message-Passing Parallel Programs

    DEFF Research Database (Denmark)

    López-Acosta, Hugo-Andrés; Eduardo R. B. Marques, Eduardo R. B.; Martins, Francisco


    a protocol language based on a dependent type system for message-passing parallel programs, which includes various communication operators, such as point-to-point messages, broadcast, reduce, array scatter and gather. For the verification of a program against a given protocol, the protocol is first......, that suffer from the state-explosion problem or that otherwise depend on parameters to the program itself. We experimentally evaluated our approach against state-of-the-art tools for MPI to conclude that our approach offers a scalable solution....

  3. Scalable quantum information processing with photons and atoms (United States)

    Pan, Jian-Wei

    Over the past three decades, the promises of super-fast quantum computing and secure quantum cryptography have spurred a world-wide interest in quantum information, generating fascinating quantum technologies for coherent manipulation of individual quantum systems. However, the distance of fiber-based quantum communications is limited due to intrinsic fiber loss and decreasing of entanglement quality. Moreover, probabilistic single-photon source and entanglement source demand exponentially increased overheads for scalable quantum information processing. To overcome these problems, we are taking two paths in parallel: quantum repeaters and through satellite. We used the decoy-state QKD protocol to close the loophole of imperfect photon source, and used the measurement-device-independent QKD protocol to close the loophole of imperfect photon detectors--two main loopholes in quantum cryptograph. Based on these techniques, we are now building world's biggest quantum secure communication backbone, from Beijing to Shanghai, with a distance exceeding 2000 km. Meanwhile, we are developing practically useful quantum repeaters that combine entanglement swapping, entanglement purification, and quantum memory for the ultra-long distance quantum communication. The second line is satellite-based global quantum communication, taking advantage of the negligible photon loss and decoherence in the atmosphere. We realized teleportation and entanglement distribution over 100 km, and later on a rapidly moving platform. We are also making efforts toward the generation of multiphoton entanglement and its use in teleportation of multiple properties of a single quantum particle, topological error correction, quantum algorithms for solving systems of linear equations and machine learning. Finally, I will talk about our recent experiments on quantum simulations on ultracold atoms. On the one hand, by applying an optical Raman lattice technique, we realized a two-dimensional spin-obit (SO

  4. Scalable Data Mining and Archiving for the Square Kilometre Array (United States)

    Jones, D. L.; Mattmann, C. A.; Hart, A. F.; Lazio, J.; Bennett, T.; Wagstaff, K. L.; Thompson, D. R.; Preston, R.


    As the technologies for remote observation improve, the rapid increase in the frequency and fidelity of those observations translates into an avalanche of data that is already beginning to eclipse the resources, both human and technical, of the institutions and facilities charged with managing the information. Common data management tasks like cataloging both data itself and contextual meta-data, creating and maintaining scalable permanent archive, and making data available on-demand for research present significant software engineering challenges when considered at the scales of modern multi-national scientific enterprises such as the upcoming Square Kilometre Array project. The NASA Jet Propulsion Laboratory (JPL), leveraging internal research and technology development funding, has begun to explore ways to address the data archiving and distribution challenges with a number of parallel activities involving collaborations with the EVLA and ALMA teams at the National Radio Astronomy Observatory (NRAO), and members of the Square Kilometre Array South Africa team. To date, we have leveraged the Apache OODT Process Control System framework and its catalog and archive service components that provide file management, workflow management, resource management as core web services. A client crawler framework ingests upstream data (e.g., EVLA raw directory output), identifies its MIME type and automatically extracts relevant metadata including temporal bounds, and job-relevant/processing information. A remote content acquisition (pushpull) service is responsible for staging remote content and handing it off to the crawler framework. A science algorithm wrapper (called CAS-PGE) wraps underlying code including CASApy programs for the EVLA, such as Continuum Imaging and Spectral Line Cube generation, executes the algorithm, and ingests its output (along with relevant extracted metadata). In addition to processing, the Process Control System has been leveraged to provide data

  5. Parallel computing solution of Boltzmann neutron transport equation

    International Nuclear Information System (INIS)

    Ansah-Narh, T.


    The focus of the research was on developing parallel computing algorithm for solving Eigen-values of the Boltzmam Neutron Transport Equation (BNTE) in a slab geometry using multi-grid approach. In response to the problem of slow execution of serial computing when solving large problems, such as BNTE, the study was focused on the design of parallel computing systems which was an evolution of serial computing that used multiple processing elements simultaneously to solve complex physical and mathematical problems. Finite element method (FEM) was used for the spatial discretization scheme, while angular discretization was accomplished by expanding the angular dependence in terms of Legendre polynomials. The eigenvalues representing the multiplication factors in the BNTE were determined by the power method. MATLAB Compiler Version 4.1 (R2009a) was used to compile the MATLAB codes of BNTE. The implemented parallel algorithms were enabled with matlabpool, a Parallel Computing Toolbox function. The option UseParallel was set to 'always' and the default value of the option was 'never'. When those conditions held, the solvers computed estimated gradients in parallel. The parallel computing system was used to handle all the bottlenecks in the matrix generated from the finite element scheme and each domain of the power method generated. The parallel algorithm was implemented on a Symmetric Multi Processor (SMP) cluster machine, which had Intel 32 bit quad-core x 86 processors. Convergence rates and timings for the algorithm on the SMP cluster machine were obtained. Numerical experiments indicated the designed parallel algorithm could reach perfect speedup and had good stability and scalability. (au)

  6. Laplacian embedded regression for scalable manifold regularization. (United States)

    Chen, Lin; Tsang, Ivor W; Xu, Dong


    world data sets show the effectiveness and scalability of the proposed framework.

  7. Quantitative measurements in single-cell analysis: towards scalability in microbial bioprocess development. (United States)

    Demling, Philipp; Westerwalbesloh, Christoph; Noack, Stephan; Wiechert, Wolfgang; Kohlheyer, Dietrich


    Single-cell analysis in microfluidic cultivation devices bears a great potential for the development and optimization of industrial bioprocesses. High parallelization allows running a large number of cultivation experiments simultaneously even under quick alteration of environmental conditions. For example, the impact of changes in media composition on cell growth during classical batch cultivation can be easily resolved. A missing link for the scalability of microfluidic experiments is, however, their complete characterization via conventional performance indicators such as product titer and productivity. While existing mass spectrometry technology is not yet sufficiently coupled with microfluidics, optical methods like enzymatic assays or fluorescence sensors are promising alternatives but require further improvement to generate quantitative measurements of extracellular metabolites. Copyright © 2018 Elsevier Ltd. All rights reserved.

  8. Scalable designs for quasiparticle-poisoning-protected topological quantum computation with Majorana zero modes (United States)

    Karzig, Torsten; Knapp, Christina; Lutchyn, Roman M.; Bonderson, Parsa; Hastings, Matthew B.; Nayak, Chetan; Alicea, Jason; Flensberg, Karsten; Plugge, Stephan; Oreg, Yuval; Marcus, Charles M.; Freedman, Michael H.


    We present designs for scalable quantum computers composed of qubits encoded in aggregates of four or more Majorana zero modes, realized at the ends of topological superconducting wire segments that are assembled into superconducting islands with significant charging energy. Quantum information can be manipulated according to a measurement-only protocol, which is facilitated by tunable couplings between Majorana zero modes and nearby semiconductor quantum dots. Our proposed architecture designs have the following principal virtues: (1) the magnetic field can be aligned in the direction of all of the topological superconducting wires since they are all parallel; (2) topological T junctions are not used, obviating possible difficulties in their fabrication and utilization; (3) quasiparticle poisoning is abated by the charging energy; (4) Clifford operations are executed by a relatively standard measurement: detection of corrections to quantum dot energy, charge, or differential capacitance induced by quantum fluctuations; (5) it is compatible with strategies for producing good approximate magic states.

  9. SparkBLAST: scalable BLAST processing using in-memory operations. (United States)

    de Castro, Marcelo Rodrigo; Tostes, Catherine Dos Santos; Dávila, Alberto M R; Senger, Hermes; da Silva, Fabricio A B


    The demand for processing ever increasing amounts of genomic data has raised new challenges for the implementation of highly scalable and efficient computational systems. In this paper we propose SparkBLAST, a parallelization of a sequence alignment application (BLAST) that employs cloud computing for the provisioning of computational resources and Apache Spark as the coordination framework. As a proof of concept, some radionuclide-resistant bacterial genomes were selected for similarity analysis. Experiments in Google and Microsoft Azure clouds demonstrated that SparkBLAST outperforms an equivalent system implemented on Hadoop in terms of speedup and execution times. The superior performance of SparkBLAST is mainly due to the in-memory operations available through the Spark framework, consequently reducing the number of local I/O operations required for distributed BLAST processing.

  10. Investigating the Role of Biogeochemical Processes in the Northern High Latitudes on Global Climate Feedbacks Using an Efficient Scalable Earth System Model

    Energy Technology Data Exchange (ETDEWEB)

    Jain, Atul K. [Univ. of Illinois, Urbana-Champaign, IL (United States)


    The overall objectives of this DOE funded project is to combine scientific and computational challenges in climate modeling by expanding our understanding of the biogeophysical-biogeochemical processes and their interactions in the northern high latitudes (NHLs) using an earth system modeling (ESM) approach, and by adopting an adaptive parallel runtime system in an ESM to achieve efficient and scalable climate simulations through improved load balancing algorithms.

  11. Fast and Scalable Computation of the Forward and Inverse Discrete Periodic Radon Transform. (United States)

    Carranza, Cesar; Llamocca, Daniel; Pattichis, Marios


    The discrete periodic radon transform (DPRT) has extensively been used in applications that involve image reconstructions from projections. Beyond classic applications, the DPRT can also be used to compute fast convolutions that avoids the use of floating-point arithmetic associated with the use of the fast Fourier transform. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions and the need for a large number of memory accesses. This paper introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on the use of: a parallel array of fixed-point adder trees; circular shift registers to remove the need for accessing external memory components when selecting the input data for the adder trees; an image block-based approach to DPRT computation that can fit the proposed architecture to available resources; and fast transpositions that are computed in one or a few clock cycles that do not depend on the size of the input image. As a result, for an N × N image (N prime), the proposed approach can compute up to N(2) additions per clock cycle. Compared with the previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources. For example, for a 251×251 image, for approximately 25% fewer flip-flops than required for a systolic implementation, we have that the scalable DPRT is computed 36 times faster. For the fastest case, we introduce optimized just 2N + ⌈log(2) N⌉ + 1 and 2N + 3 ⌈log(2) N⌉ + B + 2 cycles, architectures that can compute the DPRT and its inverse in respectively, where B is the number of bits used to represent each input pixel. On the other hand, the scalable DPRT approach requires more 1-b additions than for the systolic implementation and provides a tradeoff between speed and additional 1-b additions. All of the proposed DPRT architectures were implemented in VHSIC Hardware Description Language

  12. Scalable Track Detection in SAR CCD Images

    Energy Technology Data Exchange (ETDEWEB)

    Chow, James G [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Quach, Tu-Thach [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)


    Existing methods to detect vehicle tracks in coherent change detection images, a product of combining two synthetic aperture radar images ta ken at different times of the same scene, rely on simple, fast models to label track pixels. These models, however, are often too simple to capture natural track features such as continuity and parallelism. We present a simple convolutional network architecture consisting of a series of 3-by-3 convolutions to detect tracks. The network is trained end-to-end to learn natural track features entirely from data. The network is computationally efficient and improves the F-score on a standard dataset to 0.988, up fr om 0.907 obtained by the current state-of-the-art method.

  13. A novel two-level dynamic parallel data scheme for large 3-D SN calculations

    International Nuclear Information System (INIS)

    Sjoden, G.E.; Shedlock, D.; Haghighat, A.; Yi, C.


    We introduce a new dynamic parallel memory optimization scheme for executing large scale 3-D discrete ordinates (Sn) simulations on distributed memory parallel computers. In order for parallel transport codes to be truly scalable, they must use parallel data storage, where only the variables that are locally computed are locally stored. Even with parallel data storage for the angular variables, cumulative storage requirements for large discrete ordinates calculations can be prohibitive. To address this problem, Memory Tuning has been implemented into the PENTRAN 3-D parallel discrete ordinates code as an optimized, two-level ('large' array, 'small' array) parallel data storage scheme. Memory Tuning can be described as the process of parallel data memory optimization. Memory Tuning dynamically minimizes the amount of required parallel data in allocated memory on each processor using a statistical sampling algorithm. This algorithm is based on the integral average and standard deviation of the number of fine meshes contained in each coarse mesh in the global problem. Because PENTRAN only stores the locally computed problem phase space, optimal two-level memory assignments can be unique on each node, depending upon the parallel decomposition used (hybrid combinations of angular, energy, or spatial). As demonstrated in the two large discrete ordinates models presented (a storage cask and an OECD MOX Benchmark), Memory Tuning can save a substantial amount of memory per parallel processor, allowing one to accomplish very large scale Sn computations. (authors)

  14. Parallel time integration software

    Energy Technology Data Exchange (ETDEWEB)


    This package implements an optimal-scaling multigrid solver for the (non) linear systems that arise from the discretization of problems with evolutionary behavior. Typically, solution algorithms for evolution equations are based on a time-marching approach, solving sequentially for one time step after the other. Parallelism in these traditional time-integrarion techniques is limited to spatial parallelism. However, current trends in computer architectures are leading twards system with more, but not faster. processors. Therefore, faster compute speeds must come from greater parallelism. One approach to achieve parallelism in time is with multigrid, but extending classical multigrid methods for elliptic poerators to this setting is a significant achievement. In this software, we implement a non-intrusive, optimal-scaling time-parallel method based on multigrid reduction techniques. The examples in the package demonstrate optimality of our multigrid-reduction-in-time algorithm (MGRIT) for solving a variety of parabolic equations in two and three sparial dimensions. These examples can also be used to show that MGRIT can achieve significant speedup in comparison to sequential time marching on modern architectures.

  15. Massively parallel multicanonical simulations (United States)

    Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard


    Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.

  16. Quality Scalability Compression on Single-Loop Solution in HEVC

    Directory of Open Access Journals (Sweden)

    Mengmeng Zhang


    Full Text Available This paper proposes a quality scalable extension design for the upcoming high efficiency video coding (HEVC standard. In the proposed design, the single-loop decoder solution is extended into the proposed scalable scenario. A novel interlayer intra/interprediction is added to reduce the amount of bits representation by exploiting the correlation between coding layers. The experimental results indicate that the average Bjøntegaard delta rate decrease of 20.50% can be gained compared with the simulcast encoding. The proposed technique achieved 47.98% Bjøntegaard delta rate reduction compared with the scalable video coding extension of the H.264/AVC. Consequently, significant rate savings confirm that the proposed method achieves better performance.

  17. Parallel programming with Python

    CERN Document Server

    Palach, Jan


    A fast, easy-to-follow and clear tutorial to help you develop Parallel computing systems using Python. Along with explaining the fundamentals, the book will also introduce you to slightly advanced concepts and will help you in implementing these techniques in the real world. If you are an experienced Python programmer and are willing to utilize the available computing resources by parallelizing applications in a simple way, then this book is for you. You are required to have a basic knowledge of Python development to get the most of this book.

  18. A Massively Parallel Sequence Similarity Search for Metagenomic Sequencing Data

    Directory of Open Access Journals (Sweden)

    Masanori Kakuta


    Full Text Available Sequence similarity searches have been widely used in the analyses of metagenomic sequencing data. Finding homologous sequences in a reference database enables the estimation of taxonomic and functional characteristics of each query sequence. Because current metagenomic sequencing data consist of a large number of nucleotide sequences, the time required for sequence similarity searches account for a large proportion of the total time. This time-consuming step makes it difficult to perform large-scale analyses. To analyze large-scale metagenomic data, such as those found in the human oral microbiome, we developed GHOST-MP (Genome-wide HOmology Search Tool on Massively Parallel system, a parallel sequence similarity search tool for massively parallel computing systems. This tool uses a fast search algorithm based on suffix arrays of query and database sequences and a hierarchical parallel search to accelerate the large-scale sequence similarity search of metagenomic sequencing data. The parallel computing efficiency and the search speed of this tool were evaluated. GHOST-MP was shown to be scalable over 10,000 CPU (Central Processing Unit cores, and achieved over 80-fold acceleration compared with mpiBLAST using the same computational resources. We applied this tool to human oral metagenomic data, and the results indicate that the oral cavity, the oral vestibule, and plaque have different characteristics based on the functional gene category.

  19. Novel Parallel Numerical Methods for Radiation and Neutron Transport

    International Nuclear Information System (INIS)

    Brown, P N


    In many of the multiphysics simulations performed at LLNL, transport calculations can take up 30 to 50% of the total run time. If Monte Carlo methods are used, the percentage can be as high as 80%. Thus, a significant core competence in the formulation, software implementation, and solution of the numerical problems arising in transport modeling is essential to Laboratory and DOE research. In this project, we worked on developing scalable solution methods for the equations that model the transport of photons and neutrons through materials. Our goal was to reduce the transport solve time in these simulations by means of more advanced numerical methods and their parallel implementations. These methods must be scalable, that is, the time to solution must remain constant as the problem size grows and additional computer resources are used. For iterative methods, scalability requires that (1) the number of iterations to reach convergence is independent of problem size, and (2) that the computational cost grows linearly with problem size. We focused on deterministic approaches to transport, building on our earlier work in which we performed a new, detailed analysis of some existing transport methods and developed new approaches. The Boltzmann equation (the underlying equation to be solved) and various solution methods have been developed over many years. Consequently, many laboratory codes are based on these methods, which are in some cases decades old. For the transport of x-rays through partially ionized plasmas in local thermodynamic equilibrium, the transport equation is coupled to nonlinear diffusion equations for the electron and ion temperatures via the highly nonlinear Planck function. We investigated the suitability of traditional-solution approaches to transport on terascale architectures and also designed new scalable algorithms; in some cases, we investigated hybrid approaches that combined both

  20. Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm (United States)

    Mavriplis, Dimitri J.


    The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.

  1. Parallel Algorithms for Graph Optimization using Tree Decompositions

    Energy Technology Data Exchange (ETDEWEB)

    Sullivan, Blair D [ORNL; Weerapurage, Dinesh P [ORNL; Groer, Christopher S [ORNL


    Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.

  2. Micro parallel liquid chromatography: enabling technology for discovery analytical chemistry. (United States)

    Lemmo, Anthony V; Hobbs, Steve; Patel, Paren


    Since the introduction of combinatorial chemistry, compound libraries have undergone a significant increase in size and diversity. The ensuing expansion and diversification of compound libraries have resulted in increased demand for analytical throughput. Following the evolution of new technologies for generating lead compounds and targets and the desire to increase research and development productivity, analytical chemistry is now gaining attention as a bottleneck that would benefit from advances in instrumentation for increased analytical throughput. The commercial introduction of the Veloce trade mark micro parallel liquid chromatography system from Nanostream offers discovery analytical chemists the capability to analyze 24 samples in parallel with as little as 0.5 microl of sample. The system offers a scalable analytical approach to address bottlenecks in historically underserved areas, such as compound library purity screening, as well as higher value-added applications, such as log P determination and aqueous solubility assessment. This article describes the Veloce system and presents representative data from several discovery analytical applications.

  3. A Scalable Smart Meter Data Generator Using Spark

    DEFF Research Database (Denmark)

    Iftikhar, Nadeem; Liu, Xiufeng; Danalachi, Sergiu


    Today, smart meters are being used worldwide. As a matter of fact smart meters produce large volumes of data. Thus, it is important for smart meter data management and analytics systems to process petabytes of data. Benchmarking and testing of these systems require scalable data, however, it can...... be challenging to get large data sets due to privacy and/or data protection regulations. This paper presents a scalable smart meter data generator using Spark that can generate realistic data sets. The proposed data generator is based on a supervised machine learning method that can generate data of any size...

  4. Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

    Energy Technology Data Exchange (ETDEWEB)

    Moody, A T; Bronevetsky, G; Mohror, K M; de Supinski, B R


    High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. A potential solution to this problem is to use multi-level checkpointing, which employs multiple types of checkpoints with different costs and different levels of resiliency in a single run. The goal is to design light-weight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures. While this approach is theoretically promising, it has not been fully evaluated in a large-scale, production system context. To this end we have designed a system, called the Scalable Checkpoint/Restart (SCR) library, that writes checkpoints to storage on the compute nodes utilizing RAM, Flash, or disk, in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.

  5. Practical parallel programming

    CERN Document Server

    Bauer, Barr E


    This is the book that will teach programmers to write faster, more efficient code for parallel processors. The reader is introduced to a vast array of procedures and paradigms on which actual coding may be based. Examples and real-life simulations using these devices are presented in C and FORTRAN.

  6. Parallel Fast Legendre Transform

    NARCIS (Netherlands)

    Alves de Inda, M.; Bisseling, R.H.; Maslen, D.K.


    We discuss a parallel implementation of a fast algorithm for the discrete polynomial Legendre transform We give an introduction to the DriscollHealy algorithm using polynomial arithmetic and present experimental results on the eciency and accuracy of our implementation The algorithms were

  7. Parallel k-means++

    Energy Technology Data Exchange (ETDEWEB)


    A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique. We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.

  8. Parallel universes beguile science

    CERN Multimedia


    A staple of mind-bending science fiction, the possibility of multiple universes has long intrigued hard-nosed physicists, mathematicians and cosmologists too. We may not be able -- as least not yet -- to prove they exist, many serious scientists say, but there are plenty of reasons to think that parallel dimensions are more than figments of eggheaded imagination.

  9. Expressing Parallelism with ROOT (United States)

    Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.


    The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.

  10. Massively parallel signature sequencing. (United States)

    Zhou, Daixing; Rao, Mahendra S; Walker, Roger; Khrebtukova, Irina; Haudenschild, Christian D; Miura, Takumi; Decola, Shannon; Vermaas, Eric; Moon, Keith; Vasicek, Thomas J


    Massively parallel signature sequencing is an ultra-high throughput sequencing technology. It can simultaneously sequence millions of sequence tags, and, therefore, is ideal for whole genome analysis. When applied to expression profiling, it reveals almost every transcript in the sample and provides its accurate expression level. This chapter describes the technology and its application in establishing stem cell transcriptome databases.

  11. Parallel hierarchical global illumination

    Energy Technology Data Exchange (ETDEWEB)

    Snell, Quinn O. [Iowa State Univ., Ames, IA (United States)


    Solving the global illumination problem is equivalent to determining the intensity of every wavelength of light in all directions at every point in a given scene. The complexity of the problem has led researchers to use approximation methods for solving the problem on serial computers. Rather than using an approximation method, such as backward ray tracing or radiosity, the authors have chosen to solve the Rendering Equation by direct simulation of light transport from the light sources. This paper presents an algorithm that solves the Rendering Equation to any desired accuracy, and can be run in parallel on distributed memory or shared memory computer systems with excellent scaling properties. It appears superior in both speed and physical correctness to recent published methods involving bidirectional ray tracing or hybrid treatments of diffuse and specular surfaces. Like progressive radiosity methods, it dynamically refines the geometry decomposition where required, but does so without the excessive storage requirements for ray histories. The algorithm, called Photon, produces a scene which converges to the global illumination solution. This amounts to a huge task for a 1997-vintage serial computer, but using the power of a parallel supercomputer significantly reduces the time required to generate a solution. Currently, Photon can be run on most parallel environments from a shared memory multiprocessor to a parallel supercomputer, as well as on clusters of heterogeneous workstations.

  12. Expressing Parallelism with ROOT

    Energy Technology Data Exchange (ETDEWEB)

    Piparo, D. [CERN; Tejedor, E. [CERN; Guiraud, E. [CERN; Ganis, G. [CERN; Mato, P. [CERN; Moneta, L. [CERN; Valls Pla, X. [CERN; Canal, P. [Fermilab


    The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.

  13. Parallel Splash Belief Propagation (United States)


    Service, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of...heaps: An alternative to Fibonacci heaps with applications to parallel computation. Communications of the ACM, 31:1343–1354, 1988. G. Elidan, I. Mcgraw

  14. Efficient high-precision matrix algebra on parallel architectures for nonlinear combinatorial optimization

    KAUST Repository

    Gunnels, John


    We provide a first demonstration of the idea that matrix-based algorithms for nonlinear combinatorial optimization problems can be efficiently implemented. Such algorithms were mainly conceived by theoretical computer scientists for proving efficiency. We are able to demonstrate the practicality of our approach by developing an implementation on a massively parallel architecture, and exploiting scalable and efficient parallel implementations of algorithms for ultra high-precision linear algebra. Additionally, we have delineated and implemented the necessary algorithmic and coding changes required in order to address problems several orders of magnitude larger, dealing with the limits of scalability from memory footprint, computational efficiency, reliability, and interconnect perspectives. © Springer and Mathematical Programming Society 2010.

  15. Proxy-equation paradigm: A strategy for massively parallel asynchronous computations (United States)

    Mittal, Ankita; Girimaji, Sharath


    Massively parallel simulations of transport equation systems call for a paradigm change in algorithm development to achieve efficient scalability. Traditional approaches require time synchronization of processing elements (PEs), which severely restricts scalability. Relaxing synchronization requirement introduces error and slows down convergence. In this paper, we propose and develop a novel "proxy equation" concept for a general transport equation that (i) tolerates asynchrony with minimal added error, (ii) preserves convergence order and thus, (iii) expected to scale efficiently on massively parallel machines. The central idea is to modify a priori the transport equation at the PE boundaries to offset asynchrony errors. Proof-of-concept computations are performed using a one-dimensional advection (convection) diffusion equation. The results demonstrate the promise and advantages of the present strategy.

  16. Weighted Local Active Pixel Pattern (WLAPP for Face Recognition in Parallel Computation Environment

    Directory of Open Access Journals (Sweden)

    Gundavarapu Mallikarjuna Rao


    Full Text Available Abstract  - The availability of multi-core technology resulted totally new computational era. Researchers are keen to explore available potential in state of art-machines for breaking the bearer imposed by serial computation. Face Recognition is one of the challenging applications on so ever computational environment. The main difficulty of traditional Face Recognition algorithms is lack of the scalability. In this paper Weighted Local Active Pixel Pattern (WLAPP, a new scalable Face Recognition Algorithm suitable for parallel environment is proposed.  Local Active Pixel Pattern (LAPP is found to be simple and computational inexpensive compare to Local Binary Patterns (LBP. WLAPP is developed based on concept of LAPP. The experimentation is performed on FG-Net Aging Database with deliberately introduced 20% distortion and the results are encouraging. Keywords — Active pixels, Face Recognition, Local Binary Pattern (LBP, Local Active Pixel Pattern (LAPP, Pattern computing, parallel workers, template, weight computation.  

  17. PVeStA: A Parallel Statistical Model Checking and Quantitative Analysis Tool

    KAUST Repository

    AlTurki, Musab


    Statistical model checking is an attractive formal analysis method for probabilistic systems such as, for example, cyber-physical systems which are often probabilistic in nature. This paper is about drastically increasing the scalability of statistical model checking, and making such scalability of analysis available to tools like Maude, where probabilistic systems can be specified at a high level as probabilistic rewrite theories. It presents PVeStA, an extension and parallelization of the VeStA statistical model checking tool [10]. PVeStA supports statistical model checking of probabilistic real-time systems specified as either: (i) discrete or continuous Markov Chains; or (ii) probabilistic rewrite theories in Maude. Furthermore, the properties that it can model check can be expressed in either: (i) PCTL/CSL, or (ii) the QuaTEx quantitative temporal logic. As our experiments show, the performance gains obtained from parallelization can be very high. © 2011 Springer-Verlag.

  18. Distributed and cloud computing from parallel processing to the Internet of Things

    CERN Document Server

    Hwang, Kai; Fox, Geoffrey C


    Distributed and Cloud Computing, named a 2012 Outstanding Academic Title by the American Library Association's Choice publication, explains how to create high-performance, scalable, reliable systems, exposing the design principles, architecture, and innovative applications of parallel, distributed, and cloud computing systems. Starting with an overview of modern distributed models, the book provides comprehensive coverage of distributed and cloud computing, including: Facilitating management, debugging, migration, and disaster recovery through virtualization Clustered systems for resear

  19. Scalable-to-lossless transform domain distributed video coding

    DEFF Research Database (Denmark)

    Huang, Xin; Ukhanova, Ann; Veselov, Anton


    Distributed video coding (DVC) is a novel approach providing new features as low complexity encoding by mainly exploiting the source statistics at the decoder based on the availability of decoder side information. In this paper, scalable-tolossless DVC is presented based on extending a lossy Tran...

  20. Scalable storage for a DBMS using transparent distribution

    NARCIS (Netherlands)

    J.S. Karlsson; M.L. Kersten (Martin)


    textabstractScalable Distributed Data Structures (SDDSs) provide a self-managing and self-organizing data storage of potentially unbounded size. This stands in contrast to common distribution schemas deployed in conventional distributed DBMS. SDDSs, however, have mostly been used in synthetic

  1. MASCOT : metadata for advanced scalable video coding tools : final report

    NARCIS (Netherlands)

    H.J.A.M. Heijmans (Henk); C. Bernard; M. Domanski; B. Pesquet-Popescu; A. Smolic; P. Schelkens; L. Torres


    textabstractThe goal of the MASCOT project was to develop new video coding schemes and tools that provide both an increased coding efficiency as well as extended scalability features compared to technology that was available at the beginning of the project. Towards that goal the following tools

  2. Computational Hydrodynamics: How Portable and Scalable Are Heterogeneous Programming Paradigms?

    DEFF Research Database (Denmark)

    Pawlak, Wojciech; Glimberg, Stefan Lemvig; Engsig-Karup, Allan Peter

    simulations in arbitrary size Numerical Wave Tanks. The application has already been studied in a series of works (see References) and is demonstrated to exhibit excellent performance portability and scalability using hybrid MPI-OpenCL/CUDA. Furthermore, it can be executed on arbitrary heterogeneous multi...

  3. Cascaded column generation for scalable predictive demand side management

    NARCIS (Netherlands)

    Toersche, Hermen; Molderink, Albert; Hurink, Johann L.; Smit, Gerardus Johannes Maria


    We propose a nested Dantzig-Wolfe decomposition, combined with dynamic programming, for the distributed scheduling of a large heterogeneous fleet of residential appliances with nonlinear behavior. A cascaded column generation approach gives a scalable optimization strategy, provided that the problem

  4. Adolescent sexuality education: An appraisal of some scalable ...

    African Journals Online (AJOL)

    Adolescent sexuality education: An appraisal of some scalable interventions for the Nigerian context. VC Pam. Abstract. Most issues around sexual intercourse are highly sensitive topics in Nigeria. Despite the disturbingly high adolescent HIV prevalence and teenage pregnancy rate in Nigeria, sexuality education is ...

  5. A Framework towards Educational Scalability of Open Online Courses

    NARCIS (Netherlands)

    Kasch, Julia; Van Rosmalen, Peter; Kalz, Marco


    Although the terms scale and scalable are often used in the context of Open Online Education (OOE), there is no clear definition about these concepts from an educational perspective on the course level. This paper critically discusses the origins of these concepts and provides a working definition

  6. Efficient Enhancement for Spatial Scalable Video Coding Transmission

    Directory of Open Access Journals (Sweden)

    Mayada Khairy


    Full Text Available Scalable Video Coding (SVC is an international standard technique for video compression. It is an extension of H.264 Advanced Video Coding (AVC. In the encoding of video streams by SVC, it is suitable to employ the macroblock (MB mode because it affords superior coding efficiency. However, the exhaustive mode decision technique that is usually used for SVC increases the computational complexity, resulting in a longer encoding time (ET. Many other algorithms were proposed to solve this problem with imperfection of increasing transmission time (TT across the network. To minimize the ET and TT, this paper introduces four efficient algorithms based on spatial scalability. The algorithms utilize the mode-distribution correlation between the base layer (BL and enhancement layers (ELs and interpolation between the EL frames. The proposed algorithms are of two categories. Those of the first category are based on interlayer residual SVC spatial scalability. They employ two methods, namely, interlayer interpolation (ILIP and the interlayer base mode (ILBM method, and enable ET and TT savings of up to 69.3% and 83.6%, respectively. The algorithms of the second category are based on full-search SVC spatial scalability. They utilize two methods, namely, full interpolation (FIP and the full-base mode (FBM method, and enable ET and TT savings of up to 55.3% and 76.6%, respectively.

  7. Scalable Nonlinear Solvers for Fully Implicit Coupled Nuclear Fuel Modeling. Final Report

    International Nuclear Information System (INIS)

    Cai, Xiao-Chuan; Yang, Chao; Pernice, Michael


    The focus of the project is on the development and customization of some highly scalable domain decomposition based preconditioning techniques for the numerical solution of nonlinear, coupled systems of partial differential equations (PDEs) arising from nuclear fuel simulations. These high-order PDEs represent multiple interacting physical fields (for example, heat conduction, oxygen transport, solid deformation), each is modeled by a certain type of Cahn-Hilliard and/or Allen-Cahn equations. Most existing approaches involve a careful splitting of the fields and the use of field-by-field iterations to obtain a solution of the coupled problem. Such approaches have many advantages such as ease of implementation since only single field solvers are needed, but also exhibit disadvantages. For example, certain nonlinear interactions between the fields may not be fully captured, and for unsteady problems, stable time integration schemes are difficult to design. In addition, when implemented on large scale parallel computers, the sequential nature of the field-by-field iterations substantially reduces the parallel efficiency. To overcome the disadvantages, fully coupled approaches have been investigated in order to obtain full physics simulations.

  8. A compact linear accelerator based on a scalable microelectromechanical-system RF-structure (United States)

    Persaud, A.; Ji, Q.; Feinberg, E.; Seidl, P. A.; Waldron, W. L.; Schenkel, T.; Lal, A.; Vinayakumar, K. B.; Ardanuc, S.; Hammer, D. A.


    A new approach for a compact radio-frequency (RF) accelerator structure is presented. The new accelerator architecture is based on the Multiple Electrostatic Quadrupole Array Linear Accelerator (MEQALAC) structure that was first developed in the 1980s. The MEQALAC utilized RF resonators producing the accelerating fields and providing for higher beam currents through parallel beamlets focused using arrays of electrostatic quadrupoles (ESQs). While the early work obtained ESQs with lateral dimensions on the order of a few centimeters, using a printed circuit board (PCB), we reduce the characteristic dimension to the millimeter regime, while massively scaling up the potential number of parallel beamlets. Using Microelectromechanical systems scalable fabrication approaches, we are working on further reducing the characteristic dimension to the sub-millimeter regime. The technology is based on RF-acceleration components and ESQs implemented in the PCB or silicon wafers where each beamlet passes through beam apertures in the wafer. The complete accelerator is then assembled by stacking these wafers. This approach has the potential for fast and inexpensive batch fabrication of the components and flexibility in system design for application specific beam energies and currents. For prototyping the accelerator architecture, the components have been fabricated using the PCB. In this paper, we present proof of concept results of the principal components using the PCB: RF acceleration and ESQ focusing. Ongoing developments on implementing components in silicon and scaling of the accelerator technology to high currents and beam energies are discussed.

  9. Neighborhood communication paradigm to increase scalability in large-scale dynamic scientific applications

    KAUST Repository

    Ovcharenko, Aleksandr


    This paper introduces a general-purpose communication package built on top of MPI which is aimed at improving inter-processor communications independently of the supercomputer architecture being considered. The package is developed to support parallel applications that rely on computation characterized by large number of messages of various sizes, often small, that are focused within processor neighborhoods. In some cases, such as solvers having static mesh partitions, the number and size of messages are known a priori. However, in other cases such as mesh adaptation, the messages evolve and vary in number and size and include the dynamic movement of partition objects. The current package provides a utility for dynamic applications based on two key attributes that are: (i) explicit consideration of the neighborhood communication pattern to avoid many-to-many calls and also to reduce the number of collective calls to a minimum, and (ii) use of non-blocking MPI functions along with message packing to manage message flow control and reduce the number and time of communication calls. The test application demonstrated is parallel unstructured mesh adaptation. Results on IBM Blue Gene/P and Cray XE6 computers show that the use of neighborhood-based communication control leads to scalable results when executing generally imbalanced mesh adaptation runs. © 2011 Elsevier B.V. All rights reserved.

  10. PyClaw: Accessible, Extensible, Scalable Tools for Wave Propagation Problems

    KAUST Repository

    Ketcheson, David I.


    Development of scientific software involves tradeoffs between ease of use, generality, and performance. We describe the design of a general hyperbolic PDE solver that can be operated with the convenience of MATLAB yet achieves efficiency near that of hand-coded Fortran and scales to the largest supercomputers. This is achieved by using Python for most of the code while employing automatically wrapped Fortran kernels for computationally intensive routines, and using Python bindings to interface with a parallel computing library and other numerical packages. The software described here is PyClaw, a Python-based structured grid solver for general systems of hyperbolic PDEs [K. T. Mandli et al., PyClaw Software, Version 1.0, (2011)]. PyClaw provides a powerful and intuitive interface to the algorithms of the existing Fortran codes Clawpack and SharpClaw, simplifying code development and use while providing massive parallelism and scalable solvers via the PETSc library. The package is further augmented by use of PyWENO for generation of efficient high-order weighted essentially nonoscillatory reconstruction code. The simplicity, capability, and performance of this approach are demonstrated through application to example problems in shallow water flow, compressible flow, and elasticity.

  11. Towards Scalable Deep Learning via I/O Analysis and Optimization

    Energy Technology Data Exchange (ETDEWEB)

    Pumma, Sarunya; Si, Min; Feng, Wu-Chun; Balaji, Pavan


    Deep learning systems have been growing in prominence as a way to automatically characterize objects, trends, and anomalies. Given the importance of deep learning systems, researchers have been investigating techniques to optimize such systems. An area of particular interest has been using large supercomputing systems to quickly generate effective deep learning networks: a phase often referred to as “training” of the deep learning neural network. As we scale existing deep learning frameworks—such as Caffe—on these large supercomputing systems, we notice that the parallelism can help improve the computation tremendously, leaving data I/O as the major bottleneck limiting the overall system scalability. In this paper, we first present a detailed analysis of the performance bottlenecks of Caffe on large supercomputing systems. Our analysis shows that the I/O subsystem of Caffe—LMDB—relies on memory-mapped I/O to access its database, which can be highly inefficient on large-scale systems because of its interaction with the process scheduling system and the network-based parallel filesystem. Based on this analysis, we then present LMDBIO, our optimized I/O plugin for Caffe that takes into account the data access pattern of Caffe in order to vastly improve I/O performance. Our experimental results show that LMDBIO can improve the overall execution time of Caffe by nearly 20-fold in some cases.

  12. StagBL : A Scalable, Portable, High-Performance Discretization and Solver Layer for Geodynamic Simulation (United States)

    Sanan, P.; Tackley, P. J.; Gerya, T.; Kaus, B. J. P.; May, D.


    StagBL is an open-source parallel solver and discretization library for geodynamic simulation,encapsulating and optimizing operations essential to staggered-grid finite volume Stokes flow solvers.It provides a parallel staggered-grid abstraction with a high-level interface in C and Fortran.On top of this abstraction, tools are available to define boundary conditions and interact with particle systems.Tools and examples to efficiently solve Stokes systems defined on the grid are provided in small (direct solver), medium (simple preconditioners), and large (block factorization and multigrid) model regimes.By working directly with leading application codes (StagYY, I3ELVIS, and LaMEM) and providing an API and examples to integrate with others, StagBL aims to become a community tool supplying scalable, portable, reproducible performance toward novel science in regional- and planet-scale geodynamics and planetary science.By implementing kernels used by many research groups beneath a uniform abstraction layer, the library will enable optimization for modern hardware, thus reducing community barriers to large- or extreme-scale parallel simulation on modern architectures. In particular, the library will include CPU-, Manycore-, and GPU-optimized variants of matrix-free operators and multigrid components.The common layer provides a framework upon which to introduce innovative new tools.StagBL will leverage p4est to provide distributed adaptive meshes, and incorporate a multigrid convergence analysis tool.These options, in addition to a wealth of solver options provided by an interface to PETSc, will make the most modern solution techniques available from a common interface. StagBL in turn provides a PETSc interface, DMStag, to its central staggered grid abstraction.We present public version 0.5 of StagBL, including preliminary integration with application codes and demonstrations with its own demonstration application, StagBLDemo. Central to StagBL is the notion of an

  13. Seismic waves modeling with the Fourier pseudo-spectral method on massively parallel machines. (United States)

    Klin, Peter


    The Fourier pseudo-spectral method (FPSM) is an approach for the 3D numerical modeling of the wave propagation, which is based on the discretization of the spatial domain in a structured grid and relies on global spatial differential operators for the solution of the wave equation. This last peculiarity is advantageous from the accuracy point of view but poses difficulties for an efficient implementation of the method to be run on parallel computers with distributed memory architecture. The 1D spatial domain decomposition approach has been so far commonly adopted in the parallel implementations of the FPSM, but it implies an intensive data exchange among all the processors involved in the computation, which can degrade the performance because of communication latencies. Moreover, the scalability of the 1D domain decomposition is limited, since the number of processors can not exceed the number of grid points along the directions in which the domain is partitioned. This limitation inhibits an efficient exploitation of the computational environments with a very large number of processors. In order to overcome the limitations of the 1D domain decomposition we implemented a parallel version of the FPSM based on a 2D domain decomposition, which allows to achieve a higher degree of parallelism and scalability on massively parallel machines with several thousands of processing elements. The parallel programming is essentially achieved using the MPI protocol but OpenMP parts are also included in order to exploit the single processor multi - threading capabilities, when available. The developed tool is aimed at the numerical simulation of the seismic waves propagation and in particular is intended for earthquake ground motion research. We show the scalability tests performed up to 16k processing elements on the IBM Blue Gene/Q computer at CINECA (Italy), as well as the application to the simulation of the earthquake ground motion in the alluvial plain of the Po river (Italy).

  14. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations (United States)

    Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. A.


    The latest release of NWChem delivers an open-source computational chemistry package with extensive capabilities for large scale simulations of chemical and biological systems. Utilizing a common computational framework, diverse theoretical descriptions can be used to provide the best solution for a given scientific problem. Scalable parallel implementations and modular software design enable efficient utilization of current computational architectures. This paper provides an overview of NWChem focusing primarily on the core theoretical modules provided by the code and their parallel performance. Program summaryProgram title: NWChem Catalogue identifier: AEGI_v1_0 Program summary URL: Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Open Source Educational Community License No. of lines in distributed program, including test data, etc.: 11 709 543 No. of bytes in distributed program, including test data, etc.: 680 696 106 Distribution format: tar.gz Programming language: Fortran 77, C Computer: all Linux based workstations and parallel supercomputers, Windows and Apple machines Operating system: Linux, OS X, Windows Has the code been vectorised or parallelized?: Code is parallelized Classification: 2.1, 2.2, 3, 7.3, 7.7, 16.1, 16.2, 16.3, 16.10, 16.13 Nature of problem: Large-scale atomistic simulations of chemical and biological systems require efficient and reliable methods for ground and excited solutions of many-electron Hamiltonian, analysis of the potential energy surface, and dynamics. Solution method: Ground and excited solutions of many-electron Hamiltonian are obtained utilizing density-functional theory, many-body perturbation approach, and coupled cluster expansion. These solutions or a combination thereof with classical descriptions are then used to analyze potential energy surface and perform dynamical simulations. Additional comments: Full

  15. Parallel grid population (United States)

    Wald, Ingo; Ize, Santiago


    Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.

  16. Ultrascalable petaflop parallel supercomputer (United States)

    Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Chiu, George [Cross River, NY; Cipolla, Thomas M [Katonah, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Hall, Shawn [Pleasantville, NY; Haring, Rudolf A [Cortlandt Manor, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kopcsay, Gerard V [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Salapura, Valentina [Chappaqua, NY; Sugavanam, Krishnan [Mahopac, NY; Takken, Todd [Brewster, NY


    A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.


    Directory of Open Access Journals (Sweden)

    Florian Ion Tiberius Petrescu


    Full Text Available Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 Moving mechanical systems parallel structures are solid, fast, and accurate. Between parallel systems it is to be noticed Stewart platforms, as the oldest systems, fast, solid and precise. The work outlines a few main elements of Stewart platforms. Begin with the geometry platform, kinematic elements of it, and presented then and a few items of dynamics. Dynamic primary element on it means the determination mechanism kinetic energy of the entire Stewart platforms. It is then in a record tail cinematic mobile by a method dot matrix of rotation. If a structural mottoelement consists of two moving elements which translates relative, drive train and especially dynamic it is more convenient to represent the mottoelement as a single moving components. We have thus seven moving parts (the six motoelements or feet to which is added mobile platform 7 and one fixed.

  18. A Parallel Supercomputer Implementation of a Biological Inspired Neural Network and its use for Pattern Recognition

    International Nuclear Information System (INIS)

    De Ladurantaye, Vincent; Lavoie, Jean; Bergeron, Jocelyn; Parenteau, Maxime; Lu Huizhong; Pichevar, Ramin; Rouat, Jean


    A parallel implementation of a large spiking neural network is proposed and evaluated. The neural network implements the binding by synchrony process using the Oscillatory Dynamic Link Matcher (ODLM). Scalability, speed and performance are compared for 2 implementations: Message Passing Interface (MPI) and Compute Unified Device Architecture (CUDA) running on clusters of multicore supercomputers and NVIDIA graphical processing units respectively. A global spiking list that represents at each instant the state of the neural network is described. This list indexes each neuron that fires during the current simulation time so that the influence of their spikes are simultaneously processed on all computing units. Our implementation shows a good scalability for very large networks. A complex and large spiking neural network has been implemented in parallel with success, thus paving the road towards real-life applications based on networks of spiking neurons. MPI offers a better scalability than CUDA, while the CUDA implementation on a GeForce GTX 285 gives the best cost to performance ratio. When running the neural network on the GTX 285, the processing speed is comparable to the MPI implementation on RQCHP's Mammouth parallel with 64 notes (128 cores).

  19. Performance modelling of parallel BLAST using Intel and PGI compilers on an infiniband-based HPC cluster. (United States)

    Al-Mulhem, Muhammed; Al-Shaikh, Raed


    The Basic Local Alignment Search (BLAST) is one of the most widely used bioinformatics programs for searching all available sequence databases for similarities between a protein or DNA query and predefined sequences, using sequence alignment technique. Recently, many attempts have been made to make the algorithm practical to run against the publicly available genome databases. This paper presents our experience in mapping and evaluating both the serial and parallel BLAST algorithms onto a large Infiniband-based High Performance Cluster. The evaluation is performed using two commonly used parallel compilers, Intel and Portland's PGI. The paper also presents the evaluation methodology along with the experimental results to illustrate the scalability of the BLAST algorithm on our state-of-the-art HPC system. Our results show that BLAST runtime scalability can be achieved with up to 87% efficiency when considering the right combination of the MPI suite, the parallel compiler, the cluster interconnect and the CPU technology.

  20. BLAST in Gid (BiG): A Grid-Enabled Software Architecture and Implementation of Parallel and Sequential BLAST

    International Nuclear Information System (INIS)

    Aparicio, G.; Blanquer, I.; Hernandez, V.; Segrelles, D.


    The integration of High-performance computing tools is a key issue in biomedical research. Many computer-based applications have been migrated to High-Performance computers to deal with their computing and storage needs such as BLAST. However, the use of clusters and computing farm presents problems in scalability. The use of a higher layer of parallelism that splits the task into highly independent long jobs that can be executed in parallel can improve the performance maintaining the efficiency. Grid technologies combined with parallel computing resources are an important enabling technology. This work presents a software architecture for executing BLAST in a International Grid Infrastructure that guarantees security, scalability and fault tolerance. The software architecture is modular an adaptable to many other high-throughput applications, both inside the field of bio computing and outside. (Author)

  1. Xyce parallel electronic simulator.

    Energy Technology Data Exchange (ETDEWEB)

    Keiter, Eric R; Mei, Ting; Russo, Thomas V.; Rankin, Eric Lamont; Schiek, Richard Louis; Thornquist, Heidi K.; Fixel, Deborah A.; Coffey, Todd S; Pawlowski, Roger P; Santarelli, Keith R.


    This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide.

  2. Algorithmically specialized parallel computers

    CERN Document Server

    Snyder, Lawrence; Gannon, Dennis B


    Algorithmically Specialized Parallel Computers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster

  3. Stability of parallel flows

    CERN Document Server

    Betchov, R


    Stability of Parallel Flows provides information pertinent to hydrodynamical stability. This book explores the stability problems that occur in various fields, including electronics, mechanics, oceanography, administration, economics, as well as naval and aeronautical engineering. Organized into two parts encompassing 10 chapters, this book starts with an overview of the general equations of a two-dimensional incompressible flow. This text then explores the stability of a laminar boundary layer and presents the equation of the inviscid approximation. Other chapters present the general equation

  4. Anti-parallel triplexes

    DEFF Research Database (Denmark)

    Kosbar, Tamer R.; Sofan, Mamdouh A.; Waly, Mohamed A.


    -parallel TFO strand was modified with Y with one or two insertions at the end of the TFO strand, the thermal stability was increased 1.2 °C and 3 °C at pH 7.2, respectively, whereas one insertion in the middle of the TFO strand decreased the thermal stability 1.4 °C compared to the wild type oligonucleotide......The phosphoramidites of DNA monomers of 7-(3-aminopropyn-1-yl)-8-aza-7-deazaadenine (Y) and 7-(3-aminopropyn-1-yl)-8-aza-7-deazaadenine LNA (Z) are synthesized, and the thermal stability at pH 7.2 and 8.2 of anti-parallel triplexes modified with these two monomers is determined. When, the anti...... chain, especially at the end of the TFO strand. On the other hand, the thermal stability of the anti-parallel triplex was dramatically decreased when the TFO strand was modified with the LNA monomer analog Z in the middle of the TFO strand (ΔTm = -9.1 °C). Also the thermal stability decreased...

  5. Extending the POSIX I/O interface: a parallel file system perspective.

    Energy Technology Data Exchange (ETDEWEB)

    Vilayannur, M.; Lang, S.; Ross, R.; Klundt, R.; Ward, L.; Mathematics and Computer Science; VMWare, Inc.; SNL


    The POSIX interface does not lend itself well to enabling good performance for high-end applications. Extensions are needed in the POSIX I/O interface so that high-concurrency HPC applications running on top of parallel file systems perform well. This paper presents the rationale, design, and evaluation of a reference implementation of a subset of the POSIX I/O interfaces on a widely used parallel file system (PVFS) on clusters. Experimental results on a set of micro-benchmarks confirm that the extensions to the POSIX interface greatly improve scalability and performance.

  6. Parallel processing architecture for H.264 deblocking filter on multi-core platforms (United States)

    Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao


    Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking

  7. Temporal scalability comparison of the H.264/SVC and distributed video codec

    DEFF Research Database (Denmark)

    Huang, Xin; Ukhanova, Ann; Belyaev, Evgeny


    The problem of the multimedia scalable video streaming is a current topic of interest. There exist many methods for scalable video coding. This paper is focused on the scalable extension of H.264/AVC (H.264/SVC) and distributed video coding (DVC). The paper presents an efficiency comparison of SVC...

  8. Towards Flexible, Reliable, High Throughput Parallel Discrete Event Simulations (United States)

    Fujimoto, Richard; Park, Alfred; Huang, Jen-Chih

    The excessive amount of time necessary to complete large-scale discrete-event simulations of complex systems such as telecommunication networks, transportation systems, and multiprocessor computers continues to plague researchers and impede progress in many important domains. Parallel discrete-event simulation techniques offer an attractive solution to addressing this problem by enabling scalable execution, and much prior research has been focused on this approach. However, effective, practical solutions have been elusive. This is largely because of the complexity and difficulty associated with developing parallel discrete event simulation systems and software. In particular, an effective parallel execution mechanism must simultaneously address a variety of issues, such as efficient synchronization, resource allocation, load distribution, and fault tolerance, to mention a few. Further, most systems that have been developed to date assume a set of processors is dedicated to completing the simulation computation, yielding inflexible execution environments. Effective exploitation of parallelization techniques requires that these issues be addressed automatically by the underlying system, largely transparent to the application developer.

  9. A Cross-Platform Infrastructure for Scalable Runtime Application Performance Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Jack Dongarra; Shirley Moore; Bart Miller, Jeffrey Hollingsworth; Tracy Rafferty


    The purpose of this project was to build an extensible cross-platform infrastructure to facilitate the development of accurate and portable performance analysis tools for current and future high performance computing (HPC) architectures. Major accomplishments include tools and techniques for multidimensional performance analysis, as well as improved support for dynamic performance monitoring of multithreaded and multiprocess applications. Previous performance tool development has been limited by the burden of having to re-write a platform-dependent low-level substrate for each architecture/operating system pair in order to obtain the necessary performance data from the system. Manual interpretation of performance data is not scalable for large-scale long-running applications. The infrastructure developed by this project provides a foundation for building portable and scalable performance analysis tools, with the end goal being to provide application developers with the information they need to analyze, understand, and tune the performance of terascale applications on HPC architectures. The backend portion of the infrastructure provides runtime instrumentation capability and access to hardware performance counters, with thread-safety for shared memory environments and a communication substrate to support instrumentation of multiprocess and distributed programs. Front end interfaces provides tool developers with a well-defined, platform-independent set of calls for requesting performance data. End-user tools have been developed that demonstrate runtime data collection, on-line and off-line analysis of performance data, and multidimensional performance analysis. The infrastructure is based on two underlying performance instrumentation technologies. These technologies are the PAPI cross-platform library interface to hardware performance counters and the cross-platform Dyninst library interface for runtime modification of executable images. The Paradyn and KOJAK

  10. BISICLES - A Scalable Finite-Volume Adaptive Mesh Refinement Ice Sheet Model (United States)

    Martin, D. F.; Cornford, S. L.; Ranken, D. F.; Le Brocq, A. M.; Gladstone, R. M.; Payne, A. J.; Ng, E. G.; Lipscomb, W. H.


    Understanding the changing behavior of land ice sheets is essential for accurate projection of sea-level change. The dynamics of ice sheets span a wide range of scales. Localized regions such as grounding lines and ice streams require extremely fine (better than 1 km) resolution to correctly capture the dynamics. Resolving such features using a uniform computational mesh would be prohibitively expensive. Conversely, there are large regions where such fine resolution is unnecessary and represents a waste of computational resources. This makes ice sheets a prime candidate for adaptive mesh refinement (AMR), in which finer spatial resolution is added only where needed, enabling the efficient use of computing resources. The Berkeley ISICLES (BISICLES) project is a collaboration among the Lawrence Berkeley and Los Alamos National Laboratories in the U.S. and the University of Bristol in the U.K. We are constructing a high-performance scalable AMR ice sheet model using the Chombo parallel AMR framework. The placement of refined meshes can easily adapt dynamically to follow the changing and evolving features of the ice sheets. We also use a vertically-integrated treatment of the momentum equation based on that of Schoof and Hindmarsh (2010), which permits additional computational efficiency. Using Chombo enables us to take advantage of existing scalable multigrid-based AMR elliptic solvers and PPM-based AMR hyperbolic solvers. Linking to the existing Glimmer-CISM community ice sheet model as an alternative dynamical core allows use of many features of the existing Glimmer-CISM model, including a coupler to CESM. We present results showing the effectiveness of our approach, both for simple benchmark problems which validate our approach, and for application to regional and continental-scale ice-sheet modeling.

  11. Time-dependent density-functional theory in massively parallel computer architectures: the OCTOPUS project. (United States)

    Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L


    Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.

  12. Time-dependent density-functional theory in massively parallel computer architectures: the octopus project (United States)

    Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.


    Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.

  13. Time-dependent density-functional theory in massively parallel computer architectures: the octopus project

    International Nuclear Information System (INIS)

    Andrade, Xavier; Aspuru-Guzik, Alán; Alberdi-Rodriguez, Joseba; Rubio, Angel; Strubbe, David A; Louie, Steven G; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Marques, Miguel A L


    Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures. (topical review)

  14. Parallel algorithm of trigonometric collocation method in nonlinear dynamics of rotors

    Directory of Open Access Journals (Sweden)

    Musil T.


    Full Text Available A parallel algorithm of a numeric procedure based on a method of trigonometric collocation is presented for investigating an unbalance response of a rotor supported by journal bearings. After a condensation process the trigonometric collocation method results in a set of nonlinear algebraic equations which is solved by the Newton-Raphson method. The order of the set is proportional to the number of nonlinear bearing coordinates and terms of the finite Fourier series. The algorithm, realized in the MATLAB parallel computing environment (DCT/DCE, uses message passing technique for interacting among processes on nodes of a parallel computer. This technique enables portability of the source code both on parallel computers with distributed and shared memory. Tests, made on a Beowulf cluster and a symmetric multiprocessor, have revealed very good speed-up and scalability of this algorithm.

  15. Design of a dataway processor for a parallel image signal processing system (United States)

    Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu


    Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.

  16. Parallel framework for topology optimization using the method of moving asymptotes

    DEFF Research Database (Denmark)

    Aage, Niels; Lazarov, Boyan Stefanov


    and simple to implement linear solvers and optimization algorithms. However, to ensure generality, the code is developed to be easily extendable in terms of physical models as well as in terms of solution methods, without compromising the parallel scalability. The widely used Method of Moving Asymptotes...... optimization algorithm is parallelized and included as a fundamental part of the code. The capabilities of the presented approaches are demonstrated on topology optimization of a Stokes flow problem with target outflow constraints as well as the minimum compliance problem with a volume constraint from linear...... problems. The only feasible way to obtain a solution within a reasonable amount of time is to use parallel computations in order to speed up the solution process. The focus of this article is on a fully parallel topology optimization framework implemented in C++, with emphasis on utilizing well tested...

  17. Resistor Combinations for Parallel Circuits. (United States)

    McTernan, James P.


    To help simplify both teaching and learning of parallel circuits, a high school electricity/electronics teacher presents and illustrates the use of tables of values for parallel resistive circuits in which total resistances are whole numbers. (MF)


    Directory of Open Access Journals (Sweden)

    M. K. Bouza


    Full Text Available The object of research is the tools to support the development of parallel programs in C/C ++. The methods and software which automates the process of designing parallel applications are proposed.

  19. Very Large Parallel Data Flow (United States)


    billion characters. Teradata Corporation’s DBC/1012, a parallel relational database machine, is another high performance engine for large database...the volume of data being processed. Such parallelism is currently exploited in multiproces- sor relational database machines such as the Teradata DBC...scale parallelism can be achieved in all-solutions relations using or-parallelism in a multiprocessor architecture. For instance, the Teradata database

  20. Parallel External Memory Graph Algorithms

    DEFF Research Database (Denmark)

    Arge, Lars Allan; Goodrich, Michael T.; Sitchinava, Nodari


    In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one o f the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest...... an optimal speedup of ¿(P) in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts....

  1. Parallel Architectures and Bioinspired Algorithms

    CERN Document Server

    Pérez, José; Lanchares, Juan


    This monograph presents examples of best practices when combining bioinspired algorithms with parallel architectures. The book includes recent work by leading researchers in the field and offers a map with the main paths already explored and new ways towards the future. Parallel Architectures and Bioinspired Algorithms will be of value to both specialists in Bioinspired Algorithms, Parallel and Distributed Computing, as well as computer science students trying to understand the present and the future of Parallel Architectures and Bioinspired Algorithms.

  2. Integrating Task and Data Parallelism


    Massingill, Berna


    Many models of concurrency and concurrent programming have been proposed; most can be categorized as either task-parallel (based on functional decomposition) or data-parallel (based on data decomposition). Task-parallel models are most effective for expressing irregular computations; data-parallel models are most effective for expressing regular computations. Some computations, however, exhibit both regular and irregular aspects. For such computations, a better programming model is one that i...

  3. Parallel Pascal - An extended Pascal for parallel computers (United States)

    Reeves, A. P.


    Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

  4. Massively Parallel Genetics. (United States)

    Shendure, Jay; Fields, Stanley


    Human genetics has historically depended on the identification of individuals whose natural genetic variation underlies an observable trait or disease risk. Here we argue that new technologies now augment this historical approach by allowing the use of massively parallel assays in model systems to measure the functional effects of genetic variation in many human genes. These studies will help establish the disease risk of both observed and potential genetic variants and to overcome the problem of "variants of uncertain significance." Copyright © 2016 by the Genetics Society of America.

  5. Parallel sphere rendering

    Energy Technology Data Exchange (ETDEWEB)

    Krogh, M.; Hansen, C.; Painter, J. [Los Alamos National Lab., NM (United States); de Verdiere, G.C. [CEA Centre d`Etudes de Limeil, 94 - Villeneuve-Saint-Georges (France)


    Sphere rendering is an important method for visualizing molecular dynamics data. This paper presents a parallel divide-and-conquer algorithm that is almost 90 times faster than current graphics workstations. To render extremely large data sets and large images, the algorithm uses the MIMD features of the supercomputers to divide up the data, render independent partial images, and then finally composite the multiple partial images using an optimal method. The algorithm and performance results are presented for the CM-5 and the T3D.

  6. Parallel Repetition From Fortification


    Moshkovitz Aaronson, Dana Hadar


    The Parallel Repetition Theorem upper-bounds the value of a repeated (tensored) two prover game in terms of the value of the base game and the number of repetitions. In this work we give a simple transformation on games – “fortification” – and show that for fortified games, the value of the repeated game decreases perfectly exponentially with the number of repetitions, up to an arbitrarily small additive error. Our proof is combinatorial and short. As corollaries, we obtain: (1) Starting from...

  7. Parallelized direct execution simulation of message-passing parallel programs (United States)

    Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.


    As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

  8. SciSpark: Highly Interactive and Scalable Model Evaluation and Climate Metrics (United States)

    Wilson, B. D.; Palamuttam, R. S.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; Verma, R.; Waliser, D. E.; Lee, H.


    Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We are developing a lightning fast Big Data technology called SciSpark based on ApacheTM Spark under a NASA AIST grant (PI Mattmann). Spark implements the map-reduce paradigm for parallel computing on a cluster, but emphasizes in-memory computation, "spilling" to disk only as needed, and so outperforms the disk-based ApacheTM Hadoop by 100x in memory and by 10x on disk. SciSpark will enable scalable model evaluation by executing large-scale comparisons of A-Train satellite observations to model grids on a cluster of 10 to 1000 compute nodes. This 2nd generation capability for NASA's Regional Climate Model Evaluation System (RCMES) will compute simple climate metrics at interactive speeds, and extend to quite sophisticated iterative algorithms such as machine-learning based clustering of temperature PDFs, and even graph-based algorithms for searching for Mesocale Convective Complexes. We have implemented a parallel data ingest capability in which the user specifies desired variables (arrays) as several time-sorted lists of URL's (i.e. using OPeNDAP, or local files). The specified variables are partitioned by time/space and then each Spark node pulls its bundle of arrays into memory to begin a computation pipeline. We also investigated the performance of several N-dim. array libraries (scala breeze, java jblas & netlib-java, and ND4J). We are currently developing science codes using ND4J and studying memory behavior on the JVM. On the pyspark side, many of our science codes already use the numpy and SciPy ecosystems. The talk will cover: the architecture of SciSpark, the design of the scientific RDD (sRDD) data structure, our

  9. Scalability of DL_POLY on High Performance Computing Platform

    Directory of Open Access Journals (Sweden)

    Mabule Samuel Mabakane


    Full Text Available This paper presents a case study on the scalability of several versions of the molecular dynamics code (DL_POLY performed on South Africa‘s Centre for High Performance Computing e1350 IBM Linux cluster, Sun system and Lengau supercomputers. Within this study different problem sizes were designed and the same chosen systems were employed in order to test the performance of DL_POLY using weak and strong scalability. It was found that the speed-up results for the small systems were better than large systems on both Ethernet and Infiniband network. However, simulations of large systems in DL_POLY performed well using Infiniband network on Lengau cluster as compared to e1350 and Sun supercomputer.

  10. Scalable quantum memory in the ultrastrong coupling regime. (United States)

    Kyaw, T H; Felicetti, S; Romero, G; Solano, E; Kwek, L-C


    Circuit quantum electrodynamics, consisting of superconducting artificial atoms coupled to on-chip resonators, represents a prime candidate to implement the scalable quantum computing architecture because of the presence of good tunability and controllability. Furthermore, recent advances have pushed the technology towards the ultrastrong coupling regime of light-matter interaction, where the qubit-resonator coupling strength reaches a considerable fraction of the resonator frequency. Here, we propose a qubit-resonator system operating in that regime, as a quantum memory device and study the storage and retrieval of quantum information in and from the Z2 parity-protected quantum memory, within experimentally feasible schemes. We are also convinced that our proposal might pave a way to realize a scalable quantum random-access memory due to its fast storage and readout performances.

  11. On the Scalability of Time-predictable Chip-Multiprocessing

    DEFF Research Database (Denmark)

    Puffitsch, Wolfgang; Schoeberl, Martin


    Real-time systems need a time-predictable execution platform to be able to determine the worst-case execution time statically. In order to be time-predictable, several advanced processor features, such as out-of-order execution and other forms of speculation, have to be avoided. However, just using...... simple processors is not an option for embedded systems with high demands on computing power. In order to provide high performance and predictability we argue to use multiprocessor systems with a time-predictable memory interface. In this paper we present the scalability of a Java chip......-multiprocessor system that is designed to be time-predictable. Adding time-predictable caches is mandatory to achieve scalability with a shared memory multi-processor system. As Java bytecode retains information about the nature of memory accesses, it is possible to implement a memory hierarchy that takes...

  12. Semantic Models for Scalable Search in the Internet of Things

    Directory of Open Access Journals (Sweden)

    Dennis Pfisterer


    Full Text Available The Internet of Things is anticipated to connect billions of embedded devices equipped with sensors to perceive their surroundings. Thereby, the state of the real world will be available online and in real-time and can be combined with other data and services in the Internet to realize novel applications such as Smart Cities, Smart Grids, or Smart Healthcare. This requires an open representation of sensor data and scalable search over data from diverse sources including sensors. In this paper we show how the Semantic Web technologies RDF (an open semantic data format and SPARQL (a query language for RDF-encoded data can be used to address those challenges. In particular, we describe how prediction models can be employed for scalable sensor search, how these prediction models can be encoded as RDF, and how the models can be queried by means of SPARQL.

  13. Continuity-Aware Scheduling Algorithm for Scalable Video Streaming

    Directory of Open Access Journals (Sweden)

    Atinat Palawan


    Full Text Available The consumer demand for retrieving and delivering visual content through consumer electronic devices has increased rapidly in recent years. The quality of video in packet networks is susceptible to certain traffic characteristics: average bandwidth availability, loss, delay and delay variation (jitter. This paper presents a scheduling algorithm that modifies the stream of scalable video to combat jitter. The algorithm provides unequal look-ahead by safeguarding the base layer (without the need for overhead of the scalable video. The results of the experiments show that our scheduling algorithm reduces the number of frames with a violated deadline and significantly improves the continuity of the video stream without compromising the average Y Peek Signal-to-Noise Ratio (PSNR.

  14. Scientific visualization uncertainty, multifield, biomedical, and scalable visualization

    CERN Document Server

    Chen, Min; Johnson, Christopher; Kaufman, Arie; Hagen, Hans


    Based on the seminar that took place in Dagstuhl, Germany in June 2011, this contributed volume studies the four important topics within the scientific visualization field: uncertainty visualization, multifield visualization, biomedical visualization and scalable visualization. • Uncertainty visualization deals with uncertain data from simulations or sampled data, uncertainty due to the mathematical processes operating on the data, and uncertainty in the visual representation, • Multifield visualization addresses the need to depict multiple data at individual locations and the combination of multiple datasets, • Biomedical is a vast field with select subtopics addressed from scanning methodologies to structural applications to biological applications, • Scalability in scientific visualization is critical as data grows and computational devices range from hand-held mobile devices to exascale computational platforms. Scientific Visualization will be useful to practitioners of scientific visualization, ...

  15. A Parallel Butterfly Algorithm

    KAUST Repository

    Poulson, Jack


    The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform (Equation Presented.) at large numbers of target points when the kernel, K(x, y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(Nd) quasi-uniformly distributed source and target points, when each appropriate submatrix of K is approximately rank-r, the running time of the algorithm is at most O(r2Nd logN). A parallelization of the butterfly algorithm is introduced which, assuming a message latency of α and per-process inverse bandwidth of β, executes in at most (Equation Presented.) time using p processes. This parallel algorithm was then instantiated in the form of the open-source DistButterfly library for the special case where K(x, y) = exp(iΦ(x, y)), where Φ(x, y) is a black-box, sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for important classes of phase functions. Using quasi-uniform sources, hyperbolic Radon transforms, and an analogue of a three-dimensional generalized Radon transform were, respectively, observed to strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively. © 2014 Society for Industrial and Applied Mathematics.

  16. Fast parallel event reconstruction

    CERN Document Server

    CERN. Geneva


    On-line processing of large data volumes produced in modern HEP experiments requires using maximum capabilities of modern and future many-core CPU and GPU architectures.One of such powerful feature is a SIMD instruction set, which allows packing several data items in one register and to operate on all of them, thus achievingmore operations per clock cycle. Motivated by the idea of using the SIMD unit ofmodern processors, the KF based track fit has been adapted for parallelism, including memory optimization, numerical analysis, vectorization with inline operator overloading, and optimization using SDKs. The speed of the algorithm has been increased in 120000 times with 0.1 ms/track, running in parallel on 16 SPEs of a Cell Blade computer.  Running on a Nehalem CPU with 8 cores it shows the processing speed of 52 ns/track using the Intel Threading Building Blocks. The same KF algorithm running on an Nvidia GTX 280 in the CUDA frameworkprovi...

  17. Theory of Parallel Mechanisms

    CERN Document Server

    Huang, Zhen; Ding, Huafeng


    This book contains mechanism analysis and synthesis. In mechanism analysis, a mobility methodology is first systematically presented. This methodology, based on the author's screw theory, proposed in 1997, of which the generality and validity was only proved recently,  is a very complex issue, researched by various scientists over the last 150 years. The principle of kinematic influence coefficient and its latest developments are described. This principle is suitable for kinematic analysis of various 6-DOF and lower-mobility parallel manipulators. The singularities are classified by a new point of view, and progress in position-singularity and orientation-singularity is stated. In addition, the concept of over-determinate input is proposed and a new method of force analysis based on screw theory is presented. In mechanism synthesis, the synthesis for spatial parallel mechanisms is discussed, and the synthesis method of difficult 4-DOF and 5-DOF symmetric mechanisms, which was first put forward by the a...

  18. Massively Parallel QCD

    Energy Technology Data Exchange (ETDEWEB)

    Soltz, R; Vranas, P; Blumrich, M; Chen, D; Gara, A; Giampap, M; Heidelberger, P; Salapura, V; Sexton, J; Bhanot, G


    The theory of the strong nuclear force, Quantum Chromodynamics (QCD), can be numerically simulated from first principles on massively-parallel supercomputers using the method of Lattice Gauge Theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures that it suggests. We demonstrate these methods on the BlueGene massively-parallel supercomputer and argue that LQCD and the BlueGene architecture are a natural match. This can be traced to the simple fact that LQCD is a regular lattice discretization of space into lattice sites while the BlueGene supercomputer is a discretization of space into compute nodes, and that both are constrained by requirements of locality. This simple relation is both technologically important and theoretically intriguing. The main result of this paper is the speedup of LQCD using up to 131,072 CPUs on the largest BlueGene/L supercomputer. The speedup is perfect with sustained performance of about 20% of peak. This corresponds to a maximum of 70.5 sustained TFlop/s. At these speeds LQCD and BlueGene are poised to produce the next generation of strong interaction physics theoretical results.

  19. Economical and scalable synthesis of 6-amino-2-cyanobenzothiazole

    Directory of Open Access Journals (Sweden)

    Jacob R. Hauser


    Full Text Available 2-Cyanobenzothiazoles (CBTs are useful building blocks for: 1 luciferin derivatives for bioluminescent imaging; and 2 handles for bioorthogonal ligations. A particularly versatile CBT is 6-amino-2-cyanobenzothiazole (ACBT, which has an amine handle for straight-forward derivatisation. Here we present an economical and scalable synthesis of ACBT based on a cyanation catalysed by 1,4-diazabicyclo[2.2.2]octane (DABCO, and discuss its advantages for scale-up over previously reported routes.

  20. Point based graphics rendering with unified scalability solutions.


    Bull, L.


    Standard real-time 3D graphics rendering algorithms use brute force polygon rendering, with complexity linear in the number of polygons and little regard for limiting processing to data that contributes to the image. Modern hardware can now render smaller scenes to pixel levels of detail, relaxing surface connectivity requirements. Sub-linear scalability optimizations are typically self-contained, requiring specific data structures, without shared functions and data. A new point based renderi...

  1. Fast & scalable pattern transfer via block copolymer nanolithography

    DEFF Research Database (Denmark)

    Li, Tao; Wang, Zhongli; Schulte, Lars


    A fully scalable and efficient pattern transfer process based on block copolymer (BCP) self-assembling directly on various substrates is demonstrated. PS-rich and PDMS-rich poly(styrene-b-dimethylsiloxane) (PS-b-PDMS) copolymers are used to give monolayer sphere morphology after spin-casting of s...... on long range lateral order, including fabrication of substrates for catalysis, solar cells, sensors, ultrafiltration membranes and templating of semiconductors or metals....

  2. Interactive and Animated Scalable Vector Graphics and R Data Displays


    Deborah Nolan; Duncan Temple Lang


    We describe an approach to creating interactive and animated graphical displays using R's graphics engine and Scalable Vector Graphics, an XML vocabulary for describing two-dimensional graphical displays. We use the svg() graphics device in R and then post-process the resulting XML documents. The post-processing identities the elements in the SVG that correspond to the different components of the graphical display, e.g., points, axes, labels, lines. One can then annotate these elements to add...

  3. A Dwarf-based Scalable Big Data Benchmarking Methodology


    Gao, Wanling; Wang, Lei; Zhan, Jianfeng; Luo, Chunjie; Zheng, Daoyi; Jia, Zhen; Xie, Biwei; Zheng, Chen; Yang, Qiang; Wang, Haibin


    Different from the traditional benchmarking methodology that creates a new benchmark or proxy for every possible workload, this paper presents a scalable big data benchmarking methodology. Among a wide variety of big data analytics workloads, we identify eight big data dwarfs, each of which captures the common requirements of each class of unit of computation while being reasonably divorced from individual implementations. We implement the eight dwarfs on different software stacks, e.g., Open...

  4. A Robust, Scalable Framework for Conducting Climate Change Susceptibility Analyses (United States)


    Climate change and extreme weather events: implications for food production, plant diseases, and pests . Global Change and Human Health 2:90–104. ERDC/EL...Approved for public release; distribution is unlimited. ERDC/EL TN-14-1 May 2014 A Robust, Scalable Framework for Conducting Climate Change ...consider climate change during their planning processes as future landscapes have the potential to vary greatly from current conditions. Military

  5. Functional scalability through generative representations: the evolution of table designs


    Gregory S Hornby


    One of the main limitations for the functional scalability of automated design systems is the representation used for encoding designs. I argue that generative representations , those which are capable of reusing elements of the encoded design in the translation to the actual artifact, are better suited for automated design because reuse of building blocks captures some design dependencies and improves the ability to make large changes in design space. To support this argument I compare a gen...

  6. Final Scientific Report: A Scalable Development Environment for Peta-Scale Computing

    Energy Technology Data Exchange (ETDEWEB)

    Karbach, Carsten [Julich Research Center (Germany); Frings, Wolfgang [Julich Research Center (Germany)


    This document is the final scientific report of the project DE-SC000120 (A scalable Development Environment for Peta-Scale Computing). The objective of this project is the extension of the Parallel Tools Platform (PTP) for applying it to peta-scale systems. PTP is an integrated development environment for parallel applications. It comprises code analysis, performance tuning, parallel debugging and system monitoring. The contribution of the Juelich Supercomputing Centre (JSC) aims to provide a scalable solution for system monitoring of supercomputers. This includes the development of a new communication protocol for exchanging status data between the target remote system and the client running PTP. The communication has to work for high latency. PTP needs to be implemented robustly and should hide the complexity of the supercomputer's architecture in order to provide a transparent access to various remote systems via a uniform user interface. This simplifies the porting of applications to different systems, because PTP functions as abstraction layer between parallel application developer and compute resources. The common requirement for all PTP components is that they have to interact with the remote supercomputer. E.g. applications are built remotely and performance tools are attached to job submissions and their output data resides on the remote system. Status data has to be collected by evaluating outputs of the remote job scheduler and the parallel debugger needs to control an application executed on the supercomputer. The challenge is to provide this functionality for peta-scale systems in real-time. The client server architecture of the established monitoring application LLview, developed by the JSC, can be applied to PTP's system monitoring. LLview provides a well-arranged overview of the supercomputer's current status. A set of statistics, a list of running and queued jobs as well as a node display mapping running jobs to their compute

  7. Scalable Coverage Maintenance for Dense Wireless Sensor Networks

    Directory of Open Access Journals (Sweden)

    Jun Lu


    Full Text Available Owing to numerous potential applications, wireless sensor networks have been attracting significant research effort recently. The critical challenge that wireless sensor networks often face is to sustain long-term operation on limited battery energy. Coverage maintenance schemes can effectively prolong network lifetime by selecting and employing a subset of sensors in the network to provide sufficient sensing coverage over a target region. We envision future wireless sensor networks composed of a vast number of miniaturized sensors in exceedingly high density. Therefore, the key issue of coverage maintenance for future sensor networks is the scalability to sensor deployment density. In this paper, we propose a novel coverage maintenance scheme, scalable coverage maintenance (SCOM, which is scalable to sensor deployment density in terms of communication overhead (i.e., number of transmitted and received beacons and computational complexity (i.e., time and space complexity. In addition, SCOM achieves high energy efficiency and load balancing over different sensors. We have validated our claims through both analysis and simulations.

  8. Scalable Video Coding with Interlayer Signal Decorrelation Techniques

    Directory of Open Access Journals (Sweden)

    Yang Wenxian


    Full Text Available Scalability is one of the essential requirements in the compression of visual data for present-day multimedia communications and storage. The basic building block for providing the spatial scalability in the scalable video coding (SVC standard is the well-known Laplacian pyramid (LP. An LP achieves the multiscale representation of the video as a base-layer signal at lower resolution together with several enhancement-layer signals at successive higher resolutions. In this paper, we propose to improve the coding performance of the enhancement layers through efficient interlayer decorrelation techniques. We first show that, with nonbiorthogonal upsampling and downsampling filters, the base layer and the enhancement layers are correlated. We investigate two structures to reduce this correlation. The first structure updates the base-layer signal by subtracting from it the low-frequency component of the enhancement layer signal. The second structure modifies the prediction in order that the low-frequency component in the new enhancement layer is diminished. The second structure is integrated in the JSVM 4.0 codec with suitable modifications in the prediction modes. Experimental results with some standard test sequences demonstrate coding gains up to 1 dB for I pictures and up to 0.7 dB for both I and P pictures.

  9. Scalable force directed graph layout algorithms using fast multipole methods

    KAUST Repository

    Yunis, Enas Abdulrahman


    We present an extension to ExaFMM, a Fast Multipole Method library, as a generalized approach for fast and scalable execution of the Force-Directed Graph Layout algorithm. The Force-Directed Graph Layout algorithm is a physics-based approach to graph layout that treats the vertices V as repelling charged particles with the edges E connecting them acting as springs. Traditionally, the amount of work required in applying the Force-Directed Graph Layout algorithm is O(|V|2 + |E|) using direct calculations and O(|V| log |V| + |E|) using truncation, filtering, and/or multi-level techniques. Correct application of the Fast Multipole Method allows us to maintain a lower complexity of O(|V| + |E|) while regaining most of the precision lost in other techniques. Solving layout problems for truly large graphs with millions of vertices still requires a scalable algorithm and implementation. We have been able to leverage the scalability and architectural adaptability of the ExaFMM library to create a Force-Directed Graph Layout implementation that runs efficiently on distributed multicore and multi-GPU architectures. © 2012 IEEE.

  10. Fiscal 1999 achievement report on regional consortium research and development project. Regional consortium on energy research and development in its 3rd year (Research for development of task oriented multiple transfer robot system TRIPTERS); 1999 nendo task tekigogata gunkosei hanso robot system TRIPTERS no kaihatsu kenkyu seika hokokusho. 3

    Energy Technology Data Exchange (ETDEWEB)



    Efforts to develop a task oriented multiple transfer robot system which will promptly and flexibly respond to changes in the task of transfer are reported. In the development of a positioning module, a dead-reckoning system (gyro, wheel encoder, etc.) is incorporated into a laser-aided positioning system for improvement in precision, and the result is tested and evaluated. In the development of a platoon transfer and cooperative transfer module, the method of pursuing the tracks of the preceding vehicle is improved for higher precision, and the resultant transfer module is tested and evaluated. A transfer management module is tested and evaluated in a transfer control simulation in which a communication system and a transfer robot are integrated. In the development of environmental state recognition and obstacle avoidance technologies, a real-time visual system is completed in which a wide-angle camera detects color-marked objects and a stereographic camera measures the range, and is subjected to performance evaluation. Studies are also made about an environmental state recognition module, the control of obstacle avoiding run, and the behavior of avoidance. A transfer mode and cooperative transfer in which plural standard robots are combined is also studied. (NEDO)

  11. Repetition and Diversification in Multi-Session Task Oriented Search (United States)

    Tyler, Sarah K.


    As the number of documents and the availability of information online grows, so to can the difficulty in sifting through documents to find what we're searching for. Traditional Information Retrieval (IR) systems consider the query as the representation of the user's needs, and as such are limited to the user's ability to describe the information…

  12. Knowledge Pipeline: A Task Oriented Way to Implement Knowledge Management

    International Nuclear Information System (INIS)

    Pan Jiajie


    Concept of knowledge pipeline: There are many pipelines named by tasks or business processes in an organization. Knowledge contributors put knowledge to its corresponding pipelines. A maintenance team could keep the knowledge in pipelines clear and valid. Users could get knowledge just like opening a faucet in terms of their tasks or business processes



    34Understanding human intentions via hidden markov models in autonomous mobile robots," in Proceedings of the 3rd ACM/IEEE international conference... Ergonomic Technical Advisory Group Conference, Langley, VA. Poore, J.C. (2016). Embedding Human-System Integration Metrics in Agile Software

  14. Educational Evaluators--A Model for Task Oriented Position Development (United States)

    Rice, David; and others


    An outline of 44 evaluator tasks is discussed in terms of its usefulness in defining, evaluating, and improving the position of "educational evaluator ; in adapting the position to the needs of particular institutions; and in designing appropriate evaluator training programs. (JES)

  15. Parallel Polarization State Generation. (United States)

    She, Alan; Capasso, Federico


    The control of polarization, an essential property of light, is of wide scientific and technological interest. The general problem of generating arbitrary time-varying states of polarization (SOP) has always been mathematically formulated by a series of linear transformations, i.e. a product of matrices, imposing a serial architecture. Here we show a parallel architecture described by a sum of matrices. The theory is experimentally demonstrated by modulating spatially-separated polarization components of a laser using a digital micromirror device that are subsequently beam combined. This method greatly expands the parameter space for engineering devices that control polarization. Consequently, performance characteristics, such as speed, stability, and spectral range, are entirely dictated by the technologies of optical intensity modulation, including absorption, reflection, emission, and scattering. This opens up important prospects for polarization state generation (PSG) with unique performance characteristics with applications in spectroscopic ellipsometry, spectropolarimetry, communications, imaging, and security.

  16. Parallel imaging microfluidic cytometer. (United States)

    Ehrlich, Daniel J; McKenna, Brian K; Evans, James G; Belkina, Anna C; Denis, Gerald V; Sherr, David H; Cheung, Man Ching


    By adding an additional degree of freedom from multichannel flow, the parallel microfluidic cytometer (PMC) combines some of the best features of fluorescence-activated flow cytometry (FCM) and microscope-based high-content screening (HCS). The PMC (i) lends itself to fast processing of large numbers of samples, (ii) adds a 1D imaging capability for intracellular localization assays (HCS), (iii) has a high rare-cell sensitivity, and (iv) has an unusual capability for time-synchronized sampling. An inability to practically handle large sample numbers has restricted applications of conventional flow cytometers and microscopes in combinatorial cell assays, network biology, and drug discovery. The PMC promises to relieve a bottleneck in these previously constrained applications. The PMC may also be a powerful tool for finding rare primary cells in the clinic. The multichannel architecture of current PMC prototypes allows 384 unique samples for a cell-based screen to be read out in ∼6-10 min, about 30 times the speed of most current FCM systems. In 1D intracellular imaging, the PMC can obtain protein localization using HCS marker strategies at many times for the sample throughput of charge-coupled device (CCD)-based microscopes or CCD-based single-channel flow cytometers. The PMC also permits the signal integration time to be varied over a larger range than is practical in conventional flow cytometers. The signal-to-noise advantages are useful, for example, in counting rare positive cells in the most difficult early stages of genome-wide screening. We review the status of parallel microfluidic cytometry and discuss some of the directions the new technology may take. Copyright © 2011 Elsevier Inc. All rights reserved.

  17. About Parallel Programming: Paradigms, Parallel Execution and Collaborative Systems

    Directory of Open Access Journals (Sweden)

    Loredana MOCEAN


    Full Text Available In the last years, there were made efforts for delineation of a stabile and unitary frame, where the problems of logical parallel processing must find solutions at least at the level of imperative languages. The results obtained by now are not at the level of the made efforts. This paper wants to be a little contribution at these efforts. We propose an overview in parallel programming, parallel execution and collaborative systems.

  18. Parallel computation with molecular-motor-propelled agents in nanofabricated networks. (United States)

    Nicolau, Dan V; Lard, Mercy; Korten, Till; van Delft, Falco C M J M; Persson, Malin; Bengtsson, Elina; Månsson, Alf; Diez, Stefan; Linke, Heiner; Nicolau, Dan V


    The combinatorial nature of many important mathematical problems, including nondeterministic-polynomial-time (NP)-complete problems, places a severe limitation on the problem size that can be solved with conventional, sequentially operating electronic computers. There have been significant efforts in conceiving parallel-computation approaches in the past, for example: DNA computation, quantum computation, and microfluidics-based computation. However, these approaches have not proven, so far, to be scalable and practical from a fabrication and operational perspective. Here, we report the foundations of an alternative parallel-computation system in which a given combinatorial problem is encoded into a graphical, modular network that is embedded in a nanofabricated planar device. Exploring the network in a parallel fashion using a large number of independent, molecular-motor-propelled agents then solves the mathematical problem. This approach uses orders of magnitude less energy than conventional computers, thus addressing issues related to power consumption and heat dissipation. We provide a proof-of-concept demonstration of such a device by solving, in a parallel fashion, the small instance {2, 5, 9} of the subset sum problem, which is a benchmark NP-complete problem. Finally, we discuss the technical advances necessary to make our system scalable with presently available technology.

  19. CaKernel – A Parallel Application Programming Framework for Heterogenous Computing Architectures

    Directory of Open Access Journals (Sweden)

    Marek Blazewicz


    Full Text Available With the recent advent of new heterogeneous computing architectures there is still a lack of parallel problem solving environments that can help scientists to use easily and efficiently hybrid supercomputers. Many scientific simulations that use structured grids to solve partial differential equations in fact rely on stencil computations. Stencil computations have become crucial in solving many challenging problems in various domains, e.g., engineering or physics. Although many parallel stencil computing approaches have been proposed, in most cases they solve only particular problems. As a result, scientists are struggling when it comes to the subject of implementing a new stencil-based simulation, especially on high performance hybrid supercomputers. In response to the presented need we extend our previous work on a parallel programming framework for CUDA – CaCUDA that now supports OpenCL. We present CaKernel – a tool that simplifies the development of parallel scientific applications on hybrid systems. CaKernel is built on the highly scalable and portable Cactus framework. In the CaKernel framework, Cactus manages the inter-process communication via MPI while CaKernel manages the code running on Graphics Processing Units (GPUs and interactions between them. As a non-trivial test case we have developed a 3D CFD code to demonstrate the performance and scalability of the automatically generated code.

  20. Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting

    International Nuclear Information System (INIS)

    Azad, Ariful; Buluc, Aydn; Pothen, Alex


    It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting path is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.