Parallel scalability of Hartree-Fock calculations
Chow, Edmond; Liu, Xing; Smelyanskiy, Mikhail; Hammond, Jeff R.
2015-03-01
Quantum chemistry is increasingly performed using large cluster computers consisting of multiple interconnected nodes. For a fixed molecular problem, the efficiency of a calculation usually decreases as more nodes are used, due to the cost of communication between the nodes. This paper empirically investigates the parallel scalability of Hartree-Fock calculations. The construction of the Fock matrix and the density matrix calculation are analyzed separately. For the former, we use a parallelization of Fock matrix construction based on a static partitioning of work followed by a work stealing phase. For the latter, we use density matrix purification from the linear scaling methods literature, but without using sparsity. When using large numbers of nodes for moderately sized problems, density matrix computations are network-bandwidth bound, making purification methods potentially faster than eigendecomposition methods.
Parallelism and Scalability in an Image Processing Application
DEFF Research Database (Denmark)
Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven
2008-01-01
parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately......The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chip. This means that parallel processing is required in application areas that traditionally have not used...
Parallelism and Scalability in an Image Processing Application
DEFF Research Database (Denmark)
Rasmussen, Morten Sleth; Stuart, Matthias Bo; Karlsson, Sven
2009-01-01
parallel programs. This paper investigates parallelism and scalability of an embedded image processing application. The major challenges faced when parallelizing the application were to extract enough parallelism from the application and to reduce load imbalance. The application has limited immediately......The recent trends in processor architecture show that parallel processing is moving into new areas of computing in the form of many-core desktop processors and multi-processor system-on-chips. This means that parallel processing is required in application areas that traditionally have not used...
Compiling Scientific Programs for Scalable Parallel Systems
National Research Council Canada - National Science Library
Kennedy, Ken
2001-01-01
...). The research performed in this project included new techniques for recognizing implicit parallelism in sequential programs, a powerful and precise set-based framework for analysis and transformation...
Scalable parallel prefix solvers for discrete ordinates transport
International Nuclear Information System (INIS)
Pautz, S.; Pandya, T.; Adams, M.
2009-01-01
The well-known 'sweep' algorithm for inverting the streaming-plus-collision term in first-order deterministic radiation transport calculations has some desirable numerical properties. However, it suffers from parallel scaling issues caused by a lack of concurrency. The maximum degree of concurrency, and thus the maximum parallelism, grows more slowly than the problem size for sweeps-based solvers. We investigate a new class of parallel algorithms that involves recasting the streaming-plus-collision problem in prefix form and solving via cyclic reduction. This method, although computationally more expensive at low levels of parallelism than the sweep algorithm, offers better theoretical scalability properties. Previous work has demonstrated this approach for one-dimensional calculations; we show how to extend it to multidimensional calculations. Notably, for multiple dimensions it appears that this approach is limited to long-characteristics discretizations; other discretizations cannot be cast in prefix form. We implement two variants of the algorithm within the radlib/SCEPTRE transport code library at Sandia National Laboratories and show results on two different massively parallel systems. Both the 'forward' and 'symmetric' solvers behave similarly, scaling well to larger degrees of parallelism then sweeps-based solvers. We do observe some issues at the highest levels of parallelism (relative to the system size) and discuss possible causes. We conclude that this approach shows good potential for future parallel systems, but the parallel scalability will depend heavily on the architecture of the communication networks of these systems. (authors)
pcircle - A Suite of Scalable Parallel File System Tools
Energy Technology Data Exchange (ETDEWEB)
2015-10-01
Most of the software related to file system are written for conventional local file system, they are serialized and can't take advantage of the benefit of a large scale parallel file system. "pcircle" software builds on top of ubiquitous MPI in cluster computing environment and "work-stealing" pattern to provide a scalable, high-performance suite of file system tools. In particular - it implemented parallel data copy and parallel data checksumming, with advanced features such as async progress report, checkpoint and restart, as well as integrity checking.
A scalable parallel algorithm for multiple objective linear programs
Wiecek, Malgorzata M.; Zhang, Hong
1994-01-01
This paper presents an ADBASE-based parallel algorithm for solving multiple objective linear programs (MOLP's). Job balance, speedup and scalability are of primary interest in evaluating efficiency of the new algorithm. Implementation results on Intel iPSC/2 and Paragon multiprocessors show that the algorithm significantly speeds up the process of solving MOLP's, which is understood as generating all or some efficient extreme points and unbounded efficient edges. The algorithm gives specially good results for large and very large problems. Motivation and justification for solving such large MOLP's are also included.
Final Report: Center for Programming Models for Scalable Parallel Computing
Energy Technology Data Exchange (ETDEWEB)
Mellor-Crummey, John [William Marsh Rice University
2011-09-13
As part of the Center for Programming Models for Scalable Parallel Computing, Rice University collaborated with project partners in the design, development and deployment of language, compiler, and runtime support for parallel programming models to support application development for the “leadership-class” computer systems at DOE national laboratories. Work over the course of this project has focused on the design, implementation, and evaluation of a second-generation version of Coarray Fortran. Research and development efforts of the project have focused on the CAF 2.0 language, compiler, runtime system, and supporting infrastructure. This has involved working with the teams that provide infrastructure for CAF that we rely on, implementing new language and runtime features, producing an open source compiler that enabled us to evaluate our ideas, and evaluating our design and implementation through the use of benchmarks. The report details the research, development, findings, and conclusions from this work.
A scalable implementation of RI-SCF on parallel computers
International Nuclear Information System (INIS)
Fruechtl, H.A.; Kendall, R.A.; Harrison, R.J.
1996-01-01
In order to avoid the integral bottleneck of conventional SCF calculations, the Resolution of the Identity (RI) method is used to obtain an approximate solution to the Hartree-Fock equations. In this approximation only three-center integrals are needed to build the Fock matrix. It has been implemented as part of the NWChem package of portable and scalable ab initio programs for parallel computers. Utilizing the V-approximation, both the Coulomb and exchange contribution to the Fock matrix can be calculated from a transformed set of three-center integrals which have to be precalculated and stored. A distributed in-core method as well as a disk based implementation have been programmed. Details of the implementation as well as the parallel programming tools used are described. We also give results and timings from benchmark calculations
Highly scalable parallel processing of extracellular recordings of Multielectrode Arrays.
Gehring, Tiago V; Vasilaki, Eleni; Giugliano, Michele
2015-01-01
Technological advances of Multielectrode Arrays (MEAs) used for multisite, parallel electrophysiological recordings, lead to an ever increasing amount of raw data being generated. Arrays with hundreds up to a few thousands of electrodes are slowly seeing widespread use and the expectation is that more sophisticated arrays will become available in the near future. In order to process the large data volumes resulting from MEA recordings there is a pressing need for new software tools able to process many data channels in parallel. Here we present a new tool for processing MEA data recordings that makes use of new programming paradigms and recent technology developments to unleash the power of modern highly parallel hardware, such as multi-core CPUs with vector instruction sets or GPGPUs. Our tool builds on and complements existing MEA data analysis packages. It shows high scalability and can be used to speed up some performance critical pre-processing steps such as data filtering and spike detection, helping to make the analysis of larger data sets tractable.
Parallel peak pruning for scalable SMP contour tree computation
Energy Technology Data Exchange (ETDEWEB)
Carr, Hamish A. [Univ. of Leeds (United Kingdom); Weber, Gunther H. [Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States); Univ. of California, Davis, CA (United States); Sewell, Christopher M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Ahrens, James P. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2017-03-09
As data sets grow to exascale, automated data analysis and visualisation are increasingly important, to intermediate human understanding and to reduce demands on disk storage via in situ analysis. Trends in architecture of high performance computing systems necessitate analysis algorithms to make effective use of combinations of massively multicore and distributed systems. One of the principal analytic tools is the contour tree, which analyses relationships between contours to identify features of more than local importance. Unfortunately, the predominant algorithms for computing the contour tree are explicitly serial, and founded on serial metaphors, which has limited the scalability of this form of analysis. While there is some work on distributed contour tree computation, and separately on hybrid GPU-CPU computation, there is no efficient algorithm with strong formal guarantees on performance allied with fast practical performance. Here in this paper, we report the first shared SMP algorithm for fully parallel contour tree computation, withfor-mal guarantees of O(lgnlgt) parallel steps and O(n lgn) work, and implementations with up to 10x parallel speed up in OpenMP and up to 50x speed up in NVIDIA Thrust.
GASPRNG: GPU accelerated scalable parallel random number generator library
Gao, Shuang; Peterson, Gregory D.
2013-04-01
Graphics processors represent a promising technology for accelerating computational science applications. Many computational science applications require fast and scalable random number generation with good statistical properties, so they use the Scalable Parallel Random Number Generators library (SPRNG). We present the GPU Accelerated SPRNG library (GASPRNG) to accelerate SPRNG in GPU-based high performance computing systems. GASPRNG includes code for a host CPU and CUDA code for execution on NVIDIA graphics processing units (GPUs) along with a programming interface to support various usage models for pseudorandom numbers and computational science applications executing on the CPU, GPU, or both. This paper describes the implementation approach used to produce high performance and also describes how to use the programming interface. The programming interface allows a user to be able to use GASPRNG the same way as SPRNG on traditional serial or parallel computers as well as to develop tightly coupled programs executing primarily on the GPU. We also describe how to install GASPRNG and use it. To help illustrate linking with GASPRNG, various demonstration codes are included for the different usage models. GASPRNG on a single GPU shows up to 280x speedup over SPRNG on a single CPU core and is able to scale for larger systems in the same manner as SPRNG. Because GASPRNG generates identical streams of pseudorandom numbers as SPRNG, users can be confident about the quality of GASPRNG for scalable computational science applications. Catalogue identifier: AEOI_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEOI_v1_0.html Program obtainable from: CPC Program Library, Queen’s University, Belfast, N. Ireland Licensing provisions: UTK license. No. of lines in distributed program, including test data, etc.: 167900 No. of bytes in distributed program, including test data, etc.: 1422058 Distribution format: tar.gz Programming language: C and CUDA. Computer: Any PC or
CX: A Scalable, Robust Network for Parallel Computing
Directory of Open Access Journals (Sweden)
Peter Cappello
2002-01-01
Full Text Available CX, a network-based computational exchange, is presented. The system's design integrates variations of ideas from other researchers, such as work stealing, non-blocking tasks, eager scheduling, and space-based coordination. The object-oriented API is simple, compact, and cleanly separates application logic from the logic that supports interprocess communication and fault tolerance. Computations, of course, run to completion in the presence of computational hosts that join and leave the ongoing computation. Such hosts, or producers, use task caching and prefetching to overlap computation with interprocessor communication. To break a potential task server bottleneck, a network of task servers is presented. Even though task servers are envisioned as reliable, the self-organizing, scalable network of n- servers, described as a sibling-connected height-balanced fat tree, tolerates a sequence of n-1 server failures. Tasks are distributed throughout the server network via a simple "diffusion" process. CX is intended as a test bed for research on automated silent auctions, reputation services, authentication services, and bonding services. CX also provides a test bed for algorithm research into network-based parallel computation.
Mapping robust parallel multigrid algorithms to scalable memory architectures
Overman, Andrea; Vanrosendale, John
1993-01-01
The convergence rate of standard multigrid algorithms degenerates on problems with stretched grids or anisotropic operators. The usual cure for this is the use of line or plane relaxation. However, multigrid algorithms based on line and plane relaxation have limited and awkward parallelism and are quite difficult to map effectively to highly parallel architectures. Newer multigrid algorithms that overcome anisotropy through the use of multiple coarse grids rather than relaxation are better suited to massively parallel architectures because they require only simple point-relaxation smoothers. In this paper, we look at the parallel implementation of a V-cycle multiple semicoarsened grid (MSG) algorithm on distributed-memory architectures such as the Intel iPSC/860 and Paragon computers. The MSG algorithms provide two levels of parallelism: parallelism within the relaxation or interpolation on each grid and across the grids on each multigrid level. Both levels of parallelism must be exploited to map these algorithms effectively to parallel architectures. This paper describes a mapping of an MSG algorithm to distributed-memory architectures that demonstrates how both levels of parallelism can be exploited. The result is a robust and effective multigrid algorithm for distributed-memory machines.
Scalability of Parallel Scientific Applications on the Cloud
Directory of Open Access Journals (Sweden)
Satish Narayana Srirama
2011-01-01
Full Text Available Cloud computing, with its promise of virtually infinite resources, seems to suit well in solving resource greedy scientific computing problems. To study the effects of moving parallel scientific applications onto the cloud, we deployed several benchmark applications like matrix–vector operations and NAS parallel benchmarks, and DOUG (Domain decomposition On Unstructured Grids on the cloud. DOUG is an open source software package for parallel iterative solution of very large sparse systems of linear equations. The detailed analysis of DOUG on the cloud showed that parallel applications benefit a lot and scale reasonable on the cloud. We could also observe the limitations of the cloud and its comparison with cluster in terms of performance. However, for efficiently running the scientific applications on the cloud infrastructure, the applications must be reduced to frameworks that can successfully exploit the cloud resources, like the MapReduce framework. Several iterative and embarrassingly parallel algorithms are reduced to the MapReduce model and their performance is measured and analyzed. The analysis showed that Hadoop MapReduce has significant problems with iterative methods, while it suits well for embarrassingly parallel algorithms. Scientific computing often uses iterative methods to solve large problems. Thus, for scientific computing on the cloud, this paper raises the necessity for better frameworks or optimizations for MapReduce.
On Scalable Deep Learning and Parallelizing Gradient Descent
AUTHOR|(CDS)2129036; Möckel, Rico; Baranowski, Zbigniew; Canali, Luca
Speeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic. Therefore, data parallel methods next to mini-batch parallelism have been suggested to further decrease the training time of parameterized models using gradient based methods. Nevertheless, asynchronous optimization was considered too unstable for practical purposes due to a lacking understanding of the underlying mechanisms. Recently, a theoretical contribution has been made which defines asynchronous optimization in terms of (implicit) momentum due to the presence of a queuing model of gradients based on past parameterizations. This thesis mainly builds upon this work to construct a better understanding why asynchronous optimization shows proportionally more divergent behavior when the number of parallel worker...
Luke, Edward Allen
1993-01-01
Two algorithms capable of computing a transonic 3-D inviscid flow field about rotating machines are considered for parallel implementation. During the study of these algorithms, a significant new method of measuring the performance of parallel algorithms is developed. The theory that supports this new method creates an empirical definition of scalable parallel algorithms that is used to produce quantifiable evidence that a scalable parallel application was developed. The implementation of the parallel application and an automated domain decomposition tool are also discussed.
A scalable method for parallelizing sampling-based motion planning algorithms
Jacobs, Sam Ade; Manavi, Kasra; Burgos, Juan; Denny, Jory; Thomas, Shawna; Amato, Nancy M.
2012-01-01
This paper describes a scalable method for parallelizing sampling-based motion planning algorithms. It subdivides configuration space (C-space) into (possibly overlapping) regions and independently, in parallel, uses standard (sequential) sampling-based planners to construct roadmaps in each region. Next, in parallel, regional roadmaps in adjacent regions are connected to form a global roadmap. By subdividing the space and restricting the locality of connection attempts, we reduce the work and inter-processor communication associated with nearest neighbor calculation, a critical bottleneck for scalability in existing parallel motion planning methods. We show that our method is general enough to handle a variety of planning schemes, including the widely used Probabilistic Roadmap (PRM) and Rapidly-exploring Random Trees (RRT) algorithms. We compare our approach to two other existing parallel algorithms and demonstrate that our approach achieves better and more scalable performance. Our approach achieves almost linear scalability on a 2400 core LINUX cluster and on a 153,216 core Cray XE6 petascale machine. © 2012 IEEE.
A scalable method for parallelizing sampling-based motion planning algorithms
Jacobs, Sam Ade
2012-05-01
This paper describes a scalable method for parallelizing sampling-based motion planning algorithms. It subdivides configuration space (C-space) into (possibly overlapping) regions and independently, in parallel, uses standard (sequential) sampling-based planners to construct roadmaps in each region. Next, in parallel, regional roadmaps in adjacent regions are connected to form a global roadmap. By subdividing the space and restricting the locality of connection attempts, we reduce the work and inter-processor communication associated with nearest neighbor calculation, a critical bottleneck for scalability in existing parallel motion planning methods. We show that our method is general enough to handle a variety of planning schemes, including the widely used Probabilistic Roadmap (PRM) and Rapidly-exploring Random Trees (RRT) algorithms. We compare our approach to two other existing parallel algorithms and demonstrate that our approach achieves better and more scalable performance. Our approach achieves almost linear scalability on a 2400 core LINUX cluster and on a 153,216 core Cray XE6 petascale machine. © 2012 IEEE.
A scalable PC-based parallel computer for lattice QCD
International Nuclear Information System (INIS)
Fodor, Z.; Katz, S.D.; Pappa, G.
2003-01-01
A PC-based parallel computer for medium/large scale lattice QCD simulations is suggested. The Eoetvoes Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz nodes. Gigabit Ethernet cards are used for nearest neighbor communication in a two-dimensional mesh. The sustained performance for dynamical staggered (wilson) quarks on large lattices is around 70(110) GFlops. The exceptional price/performance ratio is below $1/Mflop
A scalable PC-based parallel computer for lattice QCD
International Nuclear Information System (INIS)
Fodor, Z.; Papp, G.
2002-09-01
A PC-based parallel computer for medium/large scale lattice QCD simulations is suggested. The Eoetvoes Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7 GHz nodes. Gigabit Ethernet cards are used for nearest neighbor communication in a two-dimensional mesh. The sustained performance for dynamical staggered(wilson) quarks on large lattices is around 70(110) GFlops. The exceptional price/performance ratio is below $1/Mflop. (orig.)
Using Python to Construct a Scalable Parallel Nonlinear Wave Solver
Mandli, Kyle
2011-01-01
Computational scientists seek to provide efficient, easy-to-use tools and frameworks that enable application scientists within a specific discipline to build and/or apply numerical models with up-to-date computing technologies that can be executed on all available computing systems. Although many tools could be useful for groups beyond a specific application, it is often difficult and time consuming to combine existing software, or to adapt it for a more general purpose. Python enables a high-level approach where a general framework can be supplemented with tools written for different fields and in different languages. This is particularly important when a large number of tools are necessary, as is the case for high performance scientific codes. This motivated our development of PetClaw, a scalable distributed-memory solver for time-dependent nonlinear wave propagation, as a case-study for how Python can be used as a highlevel framework leveraging a multitude of codes, efficient both in the reuse of code and programmer productivity. We present scaling results for computations on up to four racks of Shaheen, an IBM BlueGene/P supercomputer at King Abdullah University of Science and Technology. One particularly important issue that PetClaw has faced is the overhead associated with dynamic loading leading to catastrophic scaling. We use the walla library to solve the issue which does so by supplanting high-cost filesystem calls with MPI operations at a low enough level that developers may avoid any changes to their codes.
Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale
International Nuclear Information System (INIS)
Daily, Jeffrey A.
2015-01-01
The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore's law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or 'homologous') on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K
Scalable Parallel Methods for Analyzing Metagenomics Data at Extreme Scale
Energy Technology Data Exchange (ETDEWEB)
Daily, Jeffrey A. [Washington State Univ., Pullman, WA (United States)
2015-05-01
The field of bioinformatics and computational biology is currently experiencing a data revolution. The exciting prospect of making fundamental biological discoveries is fueling the rapid development and deployment of numerous cost-effective, high-throughput next-generation sequencing technologies. The result is that the DNA and protein sequence repositories are being bombarded with new sequence information. Databases are continuing to report a Moore’s law-like growth trajectory in their database sizes, roughly doubling every 18 months. In what seems to be a paradigm-shift, individual projects are now capable of generating billions of raw sequence data that need to be analyzed in the presence of already annotated sequence information. While it is clear that data-driven methods, such as sequencing homology detection, are becoming the mainstay in the field of computational life sciences, the algorithmic advancements essential for implementing complex data analytics at scale have mostly lagged behind. Sequence homology detection is central to a number of bioinformatics applications including genome sequencing and protein family characterization. Given millions of sequences, the goal is to identify all pairs of sequences that are highly similar (or “homologous”) on the basis of alignment criteria. While there are optimal alignment algorithms to compute pairwise homology, their deployment for large-scale is currently not feasible; instead, heuristic methods are used at the expense of quality. In this dissertation, we present the design and evaluation of a parallel implementation for conducting optimal homology detection on distributed memory supercomputers. Our approach uses a combination of techniques from asynchronous load balancing (viz. work stealing, dynamic task counters), data replication, and exact-matching filters to achieve homology detection at scale. Results for a collection of 2.56M sequences show parallel efficiencies of ~75-100% on up to 8K cores
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
International Nuclear Information System (INIS)
Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; Stuehn, Torsten
2017-01-01
Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach and paper, the theoretical modeling and scaling laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. Finally, these two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.
Scalable and fast heterogeneous molecular simulation with predictive parallelization schemes
Guzman, Horacio V.; Junghans, Christoph; Kremer, Kurt; Stuehn, Torsten
2017-11-01
Multiscale and inhomogeneous molecular systems are challenging topics in the field of molecular simulation. In particular, modeling biological systems in the context of multiscale simulations and exploring material properties are driving a permanent development of new simulation methods and optimization algorithms. In computational terms, those methods require parallelization schemes that make a productive use of computational resources for each simulation and from its genesis. Here, we introduce the heterogeneous domain decomposition approach, which is a combination of an heterogeneity-sensitive spatial domain decomposition with an a priori rearrangement of subdomain walls. Within this approach, the theoretical modeling and scaling laws for the force computation time are proposed and studied as a function of the number of particles and the spatial resolution ratio. We also show the new approach capabilities, by comparing it to both static domain decomposition algorithms and dynamic load-balancing schemes. Specifically, two representative molecular systems have been simulated and compared to the heterogeneous domain decomposition proposed in this work. These two systems comprise an adaptive resolution simulation of a biomolecule solvated in water and a phase-separated binary Lennard-Jones fluid.
A scalable approach to modeling groundwater flow on massively parallel computers
International Nuclear Information System (INIS)
Ashby, S.F.; Falgout, R.D.; Tompson, A.F.B.
1995-12-01
We describe a fully scalable approach to the simulation of groundwater flow on a hierarchy of computing platforms, ranging from workstations to massively parallel computers. Specifically, we advocate the use of scalable conceptual models in which the subsurface model is defined independently of the computational grid on which the simulation takes place. We also describe a scalable multigrid algorithm for computing the groundwater flow velocities. We axe thus able to leverage both the engineer's time spent developing the conceptual model and the computing resources used in the numerical simulation. We have successfully employed this approach at the LLNL site, where we have run simulations ranging in size from just a few thousand spatial zones (on workstations) to more than eight million spatial zones (on the CRAY T3D)-all using the same conceptual model
Tolba, Khaled Ibrahim; Morgenthal, Guido
2018-01-01
This paper presents an analysis of the scalability and efficiency of a simulation framework based on the vortex particle method. The code is applied for the numerical aerodynamic analysis of line-like structures. The numerical code runs on multicore CPU and GPU architectures using OpenCL framework. The focus of this paper is the analysis of the parallel efficiency and scalability of the method being applied to an engineering test case, specifically the aeroelastic response of a long-span bridge girder at the construction stage. The target is to assess the optimal configuration and the required computer architecture, such that it becomes feasible to efficiently utilise the method within the computational resources available for a regular engineering office. The simulations and the scalability analysis are performed on a regular gaming type computer.
Liu, Yang
2015-12-17
A scalable parallel plane-wave time-domain (PWTD) algorithm for efficient and accurate analysis of transient scattering from electrically large objects is presented. The algorithm produces scalable communication patterns on very large numbers of processors by leveraging two mechanisms: (i) a hierarchical parallelization strategy to evenly distribute the computation and memory loads at all levels of the PWTD tree among processors, and (ii) a novel asynchronous communication scheme to reduce the cost and memory requirement of the communications between the processors. The efficiency and accuracy of the algorithm are demonstrated through its applications to the analysis of transient scattering from a perfect electrically conducting (PEC) sphere with a diameter of 70 wavelengths and a PEC square plate with a dimension of 160 wavelengths. Furthermore, the proposed algorithm is used to analyze transient fields scattered from realistic airplane and helicopter models under high frequency excitation.
Yu, Leiming; Nina-Paravecino, Fanny; Kaeli, David; Fang, Qianqian
2018-01-01
We present a highly scalable Monte Carlo (MC) three-dimensional photon transport simulation platform designed for heterogeneous computing systems. Through the development of a massively parallel MC algorithm using the Open Computing Language framework, this research extends our existing graphics processing unit (GPU)-accelerated MC technique to a highly scalable vendor-independent heterogeneous computing environment, achieving significantly improved performance and software portability. A number of parallel computing techniques are investigated to achieve portable performance over a wide range of computing hardware. Furthermore, multiple thread-level and device-level load-balancing strategies are developed to obtain efficient simulations using multiple central processing units and GPUs. (2018) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE).
West, Jeff; Yang, H. Q.
2014-01-01
There are many instances involving liquid/gas interfaces and their dynamics in the design of liquid engine powered rockets such as the Space Launch System (SLS). Some examples of these applications are: Propellant tank draining and slosh, subcritical condition injector analysis for gas generators, preburners and thrust chambers, water deluge mitigation for launch induced environments and even solid rocket motor liquid slag dynamics. Commercially available CFD programs simulating gas/liquid interfaces using the Volume of Fluid approach are currently limited in their parallel scalability. In 2010 for instance, an internal NASA/MSFC review of three commercial tools revealed that parallel scalability was seriously compromised at 8 cpus and no additional speedup was possible after 32 cpus. Other non-interface CFD applications at the time were demonstrating useful parallel scalability up to 4,096 processors or more. Based on this review, NASA/MSFC initiated an effort to implement a Volume of Fluid implementation within the unstructured mesh, pressure-based algorithm CFD program, Loci-STREAM. After verification was achieved by comparing results to the commercial CFD program CFD-Ace+, and validation by direct comparison with data, Loci-STREAM-VoF is now the production CFD tool for propellant slosh force and slosh damping rate simulations at NASA/MSFC. On these applications, good parallel scalability has been demonstrated for problems sizes of tens of millions of cells and thousands of cpu cores. Ongoing efforts are focused on the application of Loci-STREAM-VoF to predict the transient flow patterns of water on the SLS Mobile Launch Platform in order to support the phasing of water for launch environment mitigation so that vehicle determinantal effects are not realized.
Scalable Parallel Distributed Coprocessor System for Graph Searching Problems with Massive Data
Directory of Open Access Journals (Sweden)
Wanrong Huang
2017-01-01
Full Text Available The Internet applications, such as network searching, electronic commerce, and modern medical applications, produce and process massive data. Considerable data parallelism exists in computation processes of data-intensive applications. A traversal algorithm, breadth-first search (BFS, is fundamental in many graph processing applications and metrics when a graph grows in scale. A variety of scientific programming methods have been proposed for accelerating and parallelizing BFS because of the poor temporal and spatial locality caused by inherent irregular memory access patterns. However, new parallel hardware could provide better improvement for scientific methods. To address small-world graph problems, we propose a scalable and novel field-programmable gate array-based heterogeneous multicore system for scientific programming. The core is multithread for streaming processing. And the communication network InfiniBand is adopted for scalability. We design a binary search algorithm to address mapping to unify all processor addresses. Within the limits permitted by the Graph500 test bench after 1D parallel hybrid BFS algorithm testing, our 8-core and 8-thread-per-core system achieved superior performance and efficiency compared with the prior work under the same degree of parallelism. Our system is efficient not as a special acceleration unit but as a processor platform that deals with graph searching applications.
Task oriented evaluation system for maintenance robots
International Nuclear Information System (INIS)
Asame, Hajime; Endo, Isao; Kotosaka, Shin-ya; Takata, Shozo; Hiraoka, Hiroyuki; Kohda, Takehisa; Matsumoto, Akihiro; Yamagishi, Kiichiro.
1994-01-01
The adaptability evaluation of maintenance robots to autonomous plants has been discussed. In this paper, a new concept of autonomous plant with maintenance robots are introduced, and a framework of autonomous maintenance system is proposed. Then, task-oriented evaluation of robot arms is discussed for evaluating their adaptability to maintenance tasks, and a new criterion called operability is proposed for adaptability evaluation. The task-oriented evaluation system is implemented and applied to structural design of robot arms. Using genetic algorithm, an optimal structure adaptable to a pump disassembly task is obtained. (author)
A highly scalable massively parallel fast marching method for the Eikonal equation
Yang, Jianming; Stern, Frederick
2017-03-01
The fast marching method is a widely used numerical method for solving the Eikonal equation arising from a variety of scientific and engineering fields. It is long deemed inherently sequential and an efficient parallel algorithm applicable to large-scale practical applications is not available in the literature. In this study, we present a highly scalable massively parallel implementation of the fast marching method using a domain decomposition approach. Central to this algorithm is a novel restarted narrow band approach that coordinates the frequency of communications and the amount of computations extra to a sequential run for achieving an unprecedented parallel performance. Within each restart, the narrow band fast marching method is executed; simple synchronous local exchanges and global reductions are adopted for communicating updated data in the overlapping regions between neighboring subdomains and getting the latest front status, respectively. The independence of front characteristics is exploited through special data structures and augmented status tags to extract the masked parallelism within the fast marching method. The efficiency, flexibility, and applicability of the parallel algorithm are demonstrated through several examples. These problems are extensively tested on six grids with up to 1 billion points using different numbers of processes ranging from 1 to 65536. Remarkable parallel speedups are achieved using tens of thousands of processes. Detailed pseudo-codes for both the sequential and parallel algorithms are provided to illustrate the simplicity of the parallel implementation and its similarity to the sequential narrow band fast marching algorithm.
Task-oriented maximally entangled states
International Nuclear Information System (INIS)
Agrawal, Pankaj; Pradhan, B
2010-01-01
We introduce the notion of a task-oriented maximally entangled state (TMES). This notion depends on the task for which a quantum state is used as the resource. TMESs are the states that can be used to carry out the task maximally. This concept may be more useful than that of a general maximally entangled state in the case of a multipartite system. We illustrate this idea by giving an operational definition of maximally entangled states on the basis of communication tasks of teleportation and superdense coding. We also give examples and a procedure to obtain such TMESs for n-qubit systems.
Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms
Fidel, Adam; Jacobs, Sam Ade; Sharma, Shishir; Amato, Nancy M.; Rauchwerger, Lawrence
2014-01-01
Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores. © 2014 IEEE.
Using Load Balancing to Scalably Parallelize Sampling-Based Motion Planning Algorithms
Fidel, Adam
2014-05-01
Motion planning, which is the problem of computing feasible paths in an environment for a movable object, has applications in many domains ranging from robotics, to intelligent CAD, to protein folding. The best methods for solving this PSPACE-hard problem are so-called sampling-based planners. Recent work introduced uniform spatial subdivision techniques for parallelizing sampling-based motion planning algorithms that scaled well. However, such methods are prone to load imbalance, as planning time depends on region characteristics and, for most problems, the heterogeneity of the sub problems increases as the number of processors increases. In this work, we introduce two techniques to address load imbalance in the parallelization of sampling-based motion planning algorithms: an adaptive work stealing approach and bulk-synchronous redistribution. We show that applying these techniques to representatives of the two major classes of parallel sampling-based motion planning algorithms, probabilistic roadmaps and rapidly-exploring random trees, results in a more scalable and load-balanced computation on more than 3,000 cores. © 2014 IEEE.
A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC
Directory of Open Access Journals (Sweden)
Yun-gang Xue
2017-01-01
Full Text Available We propose a highly parallel and scalable motion estimation algorithm, named multilevel resolution motion estimation (MLRME for short, by combining the advantages of local full search and downsampling. By subsampling a video frame, a large amount of computation is saved. While using the local full-search method, it can exploit massive parallelism and make full use of the powerful modern many-core accelerators, such as GPU and Intel Xeon Phi. We implanted the proposed MLRME into HM12.0, and the experimental results showed that the encoding quality of the MLRME method is close to that of the fast motion estimation in HEVC, which declines by less than 1.5%. We also implemented the MLRME with CUDA, which obtained 30–60x speed-up compared to the serial algorithm on single CPU. Specifically, the parallel implementation of MLRME on a GTX 460 GPU can meet the real-time coding requirement with about 25 fps for the 2560×1600 video format, while, for 832×480, the performance is more than 100 fps.
International Nuclear Information System (INIS)
Hoisie, A.; Lubeck, O.; Wasserman, H.
1998-01-01
The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavefronts. They validate the model on three important supercomputer systems, on up to 500 processors. They use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. They also use the validated model to make estimates of performance and scalability of wavefront algorithms on 100-TFLOPS computer systems expected to be in existence within the next decade as part of the ASCI program and elsewhere. In this context, they analyze two problem sizes. The model shows that on the largest such problem (1 billion cells), inter-processor communication performance is not the bottleneck. Single-node efficiency is the dominant factor
Marek, A; Blum, V; Johanni, R; Havu, V; Lang, B; Auckenthaler, T; Heinecke, A; Bungartz, H-J; Lederer, H
2014-05-28
Obtaining the eigenvalues and eigenvectors of large matrices is a key problem in electronic structure theory and many other areas of computational science. The computational effort formally scales as O(N(3)) with the size of the investigated problem, N (e.g. the electron count in electronic structure theory), and thus often defines the system size limit that practical calculations cannot overcome. In many cases, more than just a small fraction of the possible eigenvalue/eigenvector pairs is needed, so that iterative solution strategies that focus only on a few eigenvalues become ineffective. Likewise, it is not always desirable or practical to circumvent the eigenvalue solution entirely. We here review some current developments regarding dense eigenvalue solvers and then focus on the Eigenvalue soLvers for Petascale Applications (ELPA) library, which facilitates the efficient algebraic solution of symmetric and Hermitian eigenvalue problems for dense matrices that have real-valued and complex-valued matrix entries, respectively, on parallel computer platforms. ELPA addresses standard as well as generalized eigenvalue problems, relying on the well documented matrix layout of the Scalable Linear Algebra PACKage (ScaLAPACK) library but replacing all actual parallel solution steps with subroutines of its own. For these steps, ELPA significantly outperforms the corresponding ScaLAPACK routines and proprietary libraries that implement the ScaLAPACK interface (e.g. Intel's MKL). The most time-critical step is the reduction of the matrix to tridiagonal form and the corresponding backtransformation of the eigenvectors. ELPA offers both a one-step tridiagonalization (successive Householder transformations) and a two-step transformation that is more efficient especially towards larger matrices and larger numbers of CPU cores. ELPA is based on the MPI standard, with an early hybrid MPI-OpenMPI implementation available as well. Scalability beyond 10,000 CPU cores for problem
Better than $1/Mflops substained: a scalable PC-based parallel computer for lattice QCD
International Nuclear Information System (INIS)
Fodor, Z.; Papp, G.
2002-02-01
We study the feasibility of a PC-based parallel computer for medium to large scale lattice QCD simulations. Our cluster built at the Eoetvoes Univ., Inst. Theor. Phys. consists of 137 Intel P4-1.7 GHz nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for staggered QCD, respectively (for 64-bit applications the performance is approximately halved). The novel feature of our system is its communication architecture. In order to have a scalable, cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware costs is spent on the communication). According to our benchmark measurements this type of communication results in around 40% communication time fraction for lattices upto 48 3 . 96 in full QCD simulations. The price/sustained-perfomance ratio for full QCD is better than $1/Mflops for Wilson (and around $1.5/Mflops for staggered) quarks for practically any lattice size, which can fit in our parallel computer. (orig.)
Reactive wavepacket dynamics for four atom systems on scalable parallel computers
International Nuclear Information System (INIS)
Goldfield, E.M.
1994-01-01
While time-dependent quantum mechanics has been successfully applied to many three atom systems, it was nevertheless a computational challenge to use wavepacket methods to study four atom systems, systems with several heavy atoms, and systems with deep potential wells. S.K. Gray and the author are studying the reaction of OH + CO ↔ (HOCO) ↔ H + CO 2 , a difficult reaction by all the above criteria. Memory considerations alone made it impossible to use a single IBM RS/6000 workstation to study a four degree-of-freedom model of this system. They have developed a scalable parallel wavepacket code for the IBM SP1 and have run it on the SP1 at Argonne and at the Cornell Theory Center. The wavepacket, defined on a four dimensional grid, is spread out among the processors. Two-dimensional FFT's are used to compute the kinetic energy operator acting on the wavepacket. Accomplishing this task, which is the computationally intensive part of the calculation, requires a global transpose of the data. This transpose is the only serious communication between processors. Since the problem is essentially data-parallel, communication is regular and load-balancing is excellent. But as the problem is moderately fine-grained and messages are long, the ratio of communication to computation is somewhat high and they typically get about 55% of ideal speed-up
Task Oriented Evaluation of Module Extraction Techniques
Palmisano, Ignazio; Tamma, Valentina; Payne, Terry; Doran, Paul
Ontology Modularization techniques identify coherent and often reusable regions within an ontology. The ability to identify such modules, thus potentially reducing the size or complexity of an ontology for a given task or set of concepts is increasingly important in the Semantic Web as domain ontologies increase in terms of size, complexity and expressivity. To date, many techniques have been developed, but evaluation of the results of these techniques is sketchy and somewhat ad hoc. Theoretical properties of modularization algorithms have only been studied in a small number of cases. This paper presents an empirical analysis of a number of modularization techniques, and the modules they identify over a number of diverse ontologies, by utilizing objective, task-oriented measures to evaluate the fitness of the modules for a number of statistical classification problems.
A scalable fully implicit framework for reservoir simulation on parallel computers
Yang, Haijian
2017-11-10
The modeling of multiphase fluid flow in porous medium is of interest in the field of reservoir simulation. The promising numerical methods in the literature are mostly based on the explicit or semi-implicit approach, which both have certain stability restrictions on the time step size. In this work, we introduce and study a scalable fully implicit solver for the simulation of two-phase flow in a porous medium with capillarity, gravity and compressibility, which is free from the limitations of the conventional methods. In the fully implicit framework, a mixed finite element method is applied to discretize the model equations for the spatial terms, and the implicit Backward Euler scheme with adaptive time stepping is used for the temporal integration. The resultant nonlinear system arising at each time step is solved in a monolithic way by using a Newton–Krylov type method. The corresponding linear system from the Newton iteration is large sparse, nonsymmetric and ill-conditioned, consequently posing a significant challenge to the fully implicit solver. To address this issue, the family of additive Schwarz preconditioners is taken into account to accelerate the convergence of the linear system, and thereby improves the robustness of the outer Newton method. Several test cases in one, two and three dimensions are used to validate the correctness of the scheme and examine the performance of the newly developed algorithm on parallel computers.
A scalable fully implicit framework for reservoir simulation on parallel computers
Yang, Haijian; Sun, Shuyu; Li, Yiteng; Yang, Chao
2017-01-01
The modeling of multiphase fluid flow in porous medium is of interest in the field of reservoir simulation. The promising numerical methods in the literature are mostly based on the explicit or semi-implicit approach, which both have certain stability restrictions on the time step size. In this work, we introduce and study a scalable fully implicit solver for the simulation of two-phase flow in a porous medium with capillarity, gravity and compressibility, which is free from the limitations of the conventional methods. In the fully implicit framework, a mixed finite element method is applied to discretize the model equations for the spatial terms, and the implicit Backward Euler scheme with adaptive time stepping is used for the temporal integration. The resultant nonlinear system arising at each time step is solved in a monolithic way by using a Newton–Krylov type method. The corresponding linear system from the Newton iteration is large sparse, nonsymmetric and ill-conditioned, consequently posing a significant challenge to the fully implicit solver. To address this issue, the family of additive Schwarz preconditioners is taken into account to accelerate the convergence of the linear system, and thereby improves the robustness of the outer Newton method. Several test cases in one, two and three dimensions are used to validate the correctness of the scheme and examine the performance of the newly developed algorithm on parallel computers.
Task-oriented training in rehabilitation after stroke : systematic review
Rensink, Marijke; Schuurmans, Marieke; Lindeman, Eline; Hafsteinsdottir, Thora
Task-oriented training in rehabilitation after stroke: systematic review. This paper is a report of a review conducted to provide an overview of the evidence in the literature on task-oriented training of stroke survivors and its relevance in daily nursing practice. Stroke is the second leading
Lyster, Peter M.; Guo, J.; Clune, T.; Larson, J. W.; Atlas, Robert (Technical Monitor)
2001-01-01
The computational complexity of algorithms for Four Dimensional Data Assimilation (4DDA) at NASA's Data Assimilation Office (DAO) is discussed. In 4DDA, observations are assimilated with the output of a dynamical model to generate best-estimates of the states of the system. It is thus a mapping problem, whereby scattered observations are converted into regular accurate maps of wind, temperature, moisture and other variables. The DAO is developing and using 4DDA algorithms that provide these datasets, or analyses, in support of Earth System Science research. Two large-scale algorithms are discussed. The first approach, the Goddard Earth Observing System Data Assimilation System (GEOS DAS), uses an atmospheric general circulation model (GCM) and an observation-space based analysis system, the Physical-space Statistical Analysis System (PSAS). GEOS DAS is very similar to global meteorological weather forecasting data assimilation systems, but is used at NASA for climate research. Systems of this size typically run at between 1 and 20 gigaflop/s. The second approach, the Kalman filter, uses a more consistent algorithm to determine the forecast error covariance matrix than does GEOS DAS. For atmospheric assimilation, the gridded dynamical fields typically have More than 10(exp 6) variables, therefore the full error covariance matrix may be in excess of a teraword. For the Kalman filter this problem can easily scale to petaflop/s proportions. We discuss the computational complexity of GEOS DAS and our implementation of the Kalman filter. We also discuss and quantify some of the technical issues and limitations in developing efficient, in terms of wall clock time, and scalable parallel implementations of the algorithms.
Byun, Hye Suk; El-Naggar, Mohamed Y.; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya
2017-10-01
Kinetic Monte Carlo (KMC) simulations are used to study long-time dynamics of a wide variety of systems. Unfortunately, the conventional KMC algorithm is not scalable to larger systems, since its time scale is inversely proportional to the simulated system size. A promising approach to resolving this issue is the synchronous parallel KMC (SPKMC) algorithm, which makes the time scale size-independent. This paper introduces a formal derivation of the SPKMC algorithm based on local transition-state and time-dependent Hartree approximations, as well as its scalable parallel implementation based on a dual linked-list cell method. The resulting algorithm has achieved a weak-scaling parallel efficiency of 0.935 on 1024 Intel Xeon processors for simulating biological electron transfer dynamics in a 4.2 billion-heme system, as well as decent strong-scaling parallel efficiency. The parallel code has been used to simulate a lattice of cytochrome complexes on a bacterial-membrane nanowire, and it is broadly applicable to other problems such as computational synthesis of new materials.
International Nuclear Information System (INIS)
Egger, M.L.; Scheurer, A.H.; Joseph, C.
1996-01-01
The issue of long reconstruction times in PET has been addressed from several points of view, resulting in an affordable dedicated system capable of handling routine 3D reconstruction in a few minutes per frame: on the hardware side using fast processors and a parallel architecture, and on the software side, using efficient implementations of computationally less intensive algorithms. Execution times obtained for the PRT-1 data set on a parallel system of five hybrid nodes, each combining an Alpha processor for computation and a transputer for communication, are the following (256 sinograms of 96 views by 128 radial samples): Ramp algorithm 56 s, Favor 81 s and reprojection algorithm of Kinahan and Rogers 187 s. The implementation of fast rebinning algorithms has shown our hardware platform to become communications-limited; they execute faster on a conventional single-processor Alpha workstation: single-slice rebinning 7 s, Fourier rebinning 22 s, 2D filtered backprojection 5 s. The scalability of the system has been demonstrated, and a saturation effect at network sizes above ten nodes has become visible; new T9000-based products lifting most of the constraints on network topology and link throughput are expected to result in improved parallel efficiency and scalability properties
Murphy, Mark; Alley, Marcus; Demmel, James; Keutzer, Kurt; Vasanawala, Shreyas; Lustig, Michael
2012-01-01
We present ℓ1-SPIRiT, a simple algorithm for auto calibrating parallel imaging (acPI) and compressed sensing (CS) that permits an efficient implementation with clinically-feasible runtimes. We propose a CS objective function that minimizes cross-channel joint sparsity in the Wavelet domain. Our reconstruction minimizes this objective via iterative soft-thresholding, and integrates naturally with iterative Self-Consistent Parallel Imaging (SPIRiT). Like many iterative MRI reconstructions, ℓ1-SPIRiT’s image quality comes at a high computational cost. Excessively long runtimes are a barrier to the clinical use of any reconstruction approach, and thus we discuss our approach to efficiently parallelizing ℓ1-SPIRiT and to achieving clinically-feasible runtimes. We present parallelizations of ℓ1-SPIRiT for both multi-GPU systems and multi-core CPUs, and discuss the software optimization and parallelization decisions made in our implementation. The performance of these alternatives depends on the processor architecture, the size of the image matrix, and the number of parallel imaging channels. Fundamentally, achieving fast runtime requires the correct trade-off between cache usage and parallelization overheads. We demonstrate image quality via a case from our clinical experimentation, using a custom 3DFT Spoiled Gradient Echo (SPGR) sequence with up to 8× acceleration via poisson-disc undersampling in the two phase-encoded directions. PMID:22345529
Kelly, Benjamin J; Fitch, James R; Hu, Yangqiu; Corsmeier, Donald J; Zhong, Huachun; Wetzel, Amy N; Nordquist, Russell D; Newsom, David L; White, Peter
2015-01-20
While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.
Liu, Yang; Yucel, Abdulkadir; Bagci, Hakan; Michielssen, Eric
2015-01-01
of processors by leveraging two mechanisms: (i) a hierarchical parallelization strategy to evenly distribute the computation and memory loads at all levels of the PWTD tree among processors, and (ii) a novel asynchronous communication scheme to reduce the cost
PetClaw: A scalable parallel nonlinear wave propagation solver for Python
Alghamdi, Amal; Ahmadia, Aron; Ketcheson, David I.; Knepley, Matthew; Mandli, Kyle; Dalcin, Lisandro
2011-01-01
We present PetClaw, a scalable distributed-memory solver for time-dependent nonlinear wave propagation. PetClaw unifies two well-known scientific computing packages, Clawpack and PETSc, using Python interfaces into both. We rely on Clawpack to provide the infrastructure and kernels for time-dependent nonlinear wave propagation. Similarly, we rely on PETSc to manage distributed data arrays and the communication between them.We describe both the implementation and performance of PetClaw as well as our challenges and accomplishments in scaling a Python-based code to tens of thousands of cores on the BlueGene/P architecture. The capabilities of PetClaw are demonstrated through application to a novel problem involving elastic waves in a heterogeneous medium. Very finely resolved simulations are used to demonstrate the suppression of shock formation in this system.
Towards scalable parallelism in Monte Carlo particle transport codes using remote memory access
International Nuclear Information System (INIS)
Romano, Paul K.; Forget, Benoit; Brown, Forrest
2010-01-01
One forthcoming challenge in the area of high-performance computing is having the ability to run large-scale problems while coping with less memory per compute node. In this work, we investigate a novel data decomposition method that would allow Monte Carlo transport calculations to be performed on systems with limited memory per compute node. In this method, each compute node remotely retrieves a small set of geometry and cross-section data as needed and remotely accumulates local tallies when crossing the boundary of the local spatial domain. Initial results demonstrate that while the method does allow large problems to be run in a memory-limited environment, achieving scalability may be difficult due to inefficiencies in the current implementation of RMA operations. (author)
Energy Technology Data Exchange (ETDEWEB)
Barbara Chapman
2012-02-01
OpenMP was not well recognized at the beginning of the project, around year 2003, because of its limited use in DoE production applications and the inmature hardware support for an efficient implementation. Yet in the recent years, it has been graduately adopted both in HPC applications, mostly in the form of MPI+OpenMP hybrid code, and in mid-scale desktop applications for scientific and experimental studies. We have observed this trend and worked deligiently to improve our OpenMP compiler and runtimes, as well as to work with the OpenMP standard organization to make sure OpenMP are evolved in the direction close to DoE missions. In the Center for Programming Models for Scalable Parallel Computing project, the HPCTools team at the University of Houston (UH), directed by Dr. Barbara Chapman, has been working with project partners, external collaborators and hardware vendors to increase the scalability and applicability of OpenMP for multi-core (and future manycore) platforms and for distributed memory systems by exploring different programming models, language extensions, compiler optimizations, as well as runtime library support.
Scalability of Parallel Spatial Direct Numerical Simulations on Intel Hypercube and IBM SP1 and SP2
Joslin, Ronald D.; Hanebutte, Ulf R.; Zubair, Mohammad
1995-01-01
The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52-56 M ops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a "real world" simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application. KEY WORDS: Spatial direct numerical simulations; incompressible viscous flows; spectral methods; finite differences; parallel computing.
Task-Oriented Gaming for Transfer to Prosthesis Use
van Dijk, Ludger; Sluis, van der Corry K.; van Dijk, Hylke W.; Bongers, Raoul M.
2016-01-01
The aim of this study is to establish the effect of task-oriented video gaming on using a myoelectric prosthesis in a basic activity of daily life (ADL). Forty-one able-bodied right-handed participants were randomly assigned to one of four groups. In three of these groups the participants trained to
ACME: A scalable parallel system for extracting frequent patterns from a very long sequence
Sahli, Majed; Mansour, Essam; Kalnis, Panos
2014-01-01
-long sequences and is the first to support supermaximal motifs. ACME is a versatile parallel system that can be deployed on desktop multi-core systems, or on thousands of CPUs in the cloud. However, merely using more compute nodes does not guarantee efficiency
Massively Parallel and Scalable Implicit Time Integration Algorithms for Structural Dynamics
Farhat, Charbel
1997-01-01
Explicit codes are often used to simulate the nonlinear dynamics of large-scale structural systems, even for low frequency response, because the storage and CPU requirements entailed by the repeated factorizations traditionally found in implicit codes rapidly overwhelm the available computing resources. With the advent of parallel processing, this trend is accelerating because of the following additional facts: (a) explicit schemes are easier to parallelize than implicit ones, and (b) explicit schemes induce short range interprocessor communications that are relatively inexpensive, while the factorization methods used in most implicit schemes induce long range interprocessor communications that often ruin the sought-after speed-up. However, the time step restriction imposed by the Courant stability condition on all explicit schemes cannot yet be offset by the speed of the currently available parallel hardware. Therefore, it is essential to develop efficient alternatives to direct methods that are also amenable to massively parallel processing because implicit codes using unconditionally stable time-integration algorithms are computationally more efficient when simulating the low-frequency dynamics of aerospace structures.
Jiang, Hayang; Xie, Gaogang; Salamatian, Kavé; Mathy, Laurent
2013-01-01
Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. Both hardware accelerated and parallel software-based NIDS solutions, based on commodity multi-core and GPU processors, have been proposed to overcome these challenges. Network Intrusion Detection Systems (NIDSes) face significant challenges coming from the relentless network link speed growth and increasing complexity of threats. ...
Race: A scalable and elastic parallel system for discovering repeats in very long sequences
Mansour, Essam
2013-08-26
A wide range of applications, including bioinformatics, time series, and log analysis, depend on the identification of repetitions in very long sequences. The problem of finding maximal pairs subsumes most important types of repetition-finding tasks. Existing solutions require both the input sequence and its index (typically an order of magnitude larger than the input) to fit in memory. Moreover, they are serial algorithms with long execution time. Therefore, they are limited to small datasets, despite the fact that modern applications demand orders of magnitude longer sequences. In this paper we present RACE, a parallel system for finding maximal pairs in very long sequences. RACE supports parallel execution on stand-alone multicore systems, in addition to scaling to thousands of nodes on clusters or supercomputers. RACE does not require the input or the index to fit in memory; therefore, it supports very long sequences with limited memory. Moreover, it uses a novel array representation that allows for cache-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we propose a cost model to estimate the expected speedup, based on statistics gathered through sampling. The model allows the user to select the appropriate combination of cloud resources based on the provider\\'s prices and the required deadline. We conducted extensive experimental evaluation with large real datasets and large computing infrastructures. In contrast to existing methods, RACE can handle the entire human genome on a typical desktop computer with 16GB RAM. Moreover, for a problem that takes 10 hours of serial execution, RACE finishes in 28 seconds using 2,048 nodes on an IBM BlueGene/P supercomputer.
ACME: A scalable parallel system for extracting frequent patterns from a very long sequence
Sahli, Majed
2014-10-02
Modern applications, including bioinformatics, time series, and web log analysis, require the extraction of frequent patterns, called motifs, from one very long (i.e., several gigabytes) sequence. Existing approaches are either heuristics that are error-prone, or exact (also called combinatorial) methods that are extremely slow, therefore, applicable only to very small sequences (i.e., in the order of megabytes). This paper presents ACME, a combinatorial approach that scales to gigabyte-long sequences and is the first to support supermaximal motifs. ACME is a versatile parallel system that can be deployed on desktop multi-core systems, or on thousands of CPUs in the cloud. However, merely using more compute nodes does not guarantee efficiency, because of the related overheads. To this end, ACME introduces an automatic tuning mechanism that suggests the appropriate number of CPUs to utilize, in order to meet the user constraints in terms of run time, while minimizing the financial cost of cloud resources. Our experiments show that, compared to the state of the art, ACME supports three orders of magnitude longer sequences (e.g., DNA for the entire human genome); handles large alphabets (e.g., English alphabet for Wikipedia); scales out to 16,384 CPUs on a supercomputer; and supports elastic deployment in the cloud.
Zinno, Ivana; De Luca, Claudio; Elefante, Stefano; Imperatore, Pasquale; Manunta, Michele; Casu, Francesco
2014-05-01
Differential Synthetic Aperture Radar Interferometry (DInSAR) is an effective technique to estimate and monitor ground displacements with centimetre accuracy [1]. In the last decade, advanced DInSAR algorithms, such as the Small Baseline Subset (SBAS) [2] one that is aimed at following the temporal evolution of the ground deformation, showed to be significantly useful remote sensing tools for the geoscience communities as well as for those related to hazard monitoring and risk mitigation. DInSAR scenario is currently characterized by the large and steady increasing availability of huge SAR data archives that have a broad range of diversified features according to the characteristics of the employed sensor. Indeed, besides the old generation sensors, that include ERS, ENVISAT and RADARSAT systems, the new X-band generation constellations, such as COSMO-SkyMed and TerraSAR-X, have permitted an overall study of ground deformations with an unprecedented detail thanks to their improved spatial resolution and reduced revisit time. Furthermore, the incoming ESA Sentinel-1 SAR satellite is characterized by a global coverage acquisition strategy and 12-day revisit time and, therefore, will further contribute to improve deformation analyses and monitoring capabilities. However, in this context, the capability to process such huge SAR data archives is strongly limited by the existing DInSAR algorithms, which are not specifically designed to exploit modern high performance computational infrastructures (e.g. cluster, grid and cloud computing platforms). The goal of this paper is to present a Parallel version of the SBAS algorithm (P-SBAS) which is based on a dual-level parallelization approach and embraces combined parallel strategies [3], [4]. A detailed description of the P-SBAS algorithm will be provided together with a scalability analysis focused on studying its performances. In particular, a P-SBAS scalability analysis with respect to the number of exploited CPUs has
Geisler, David J; Fontaine, Nicolas K; Scott, Ryan P; He, Tingting; Paraschis, Loukas; Gerstel, Ori; Heritage, Jonathan P; Yoo, S J B
2011-04-25
We demonstrate an optical transmitter based on dynamic optical arbitrary waveform generation (OAWG) which is capable of creating high-bandwidth (THz) data waveforms in any modulation format using the parallel synthesis of multiple coherent spectral slices. As an initial demonstration, the transmitter uses only 5.5 GHz of electrical bandwidth and two 10-GHz-wide spectral slices to create 100-ns duration, 20-GHz optical waveforms in various modulation formats including differential phase-shift keying (DPSK), quaternary phase-shift keying (QPSK), and eight phase-shift keying (8PSK) with only changes in software. The experimentally generated waveforms showed clear eye openings and separated constellation points when measured using a real-time digital coherent receiver. Bit-error-rate (BER) performance analysis resulted in a BER < 9.8 × 10(-6) for DPSK and QPSK waveforms. Additionally, we experimentally demonstrate three-slice, 4-ns long waveforms that highlight the bandwidth scalable nature of the optical transmitter. The various generated waveforms show that the key transmitter properties (i.e., packet length, modulation format, data rate, and modulation filter shape) are software definable, and that the optical transmitter is capable of acting as a flexible bandwidth transmitter.
a Task-Oriented Disaster Information Correlation Method
Linyao, Q.; Zhiqiang, D.; Qing, Z.
2015-07-01
With the rapid development of sensor networks and Earth observation technology, a large quantity of disaster-related data is available, such as remotely sensed data, historic data, case data, simulated data, and disaster products. However, the efficiency of current data management and service systems has become increasingly difficult due to the task variety and heterogeneous data. For emergency task-oriented applications, the data searches primarily rely on artificial experience based on simple metadata indices, the high time consumption and low accuracy of which cannot satisfy the speed and veracity requirements for disaster products. In this paper, a task-oriented correlation method is proposed for efficient disaster data management and intelligent service with the objectives of 1) putting forward disaster task ontology and data ontology to unify the different semantics of multi-source information, 2) identifying the semantic mapping from emergency tasks to multiple data sources on the basis of uniform description in 1), and 3) linking task-related data automatically and calculating the correlation between each data set and a certain task. The method goes beyond traditional static management of disaster data and establishes a basis for intelligent retrieval and active dissemination of disaster information. The case study presented in this paper illustrates the use of the method on an example flood emergency relief task.
Task-oriented lossy compression of magnetic resonance images
Anderson, Mark C.; Atkins, M. Stella; Vaisey, Jacques
1996-04-01
A new task-oriented image quality metric is used to quantify the effects of distortion introduced into magnetic resonance images by lossy compression. This metric measures the similarity between a radiologist's manual segmentation of pathological features in the original images and the automated segmentations performed on the original and compressed images. The images are compressed using a general wavelet-based lossy image compression technique, embedded zerotree coding, and segmented using a three-dimensional stochastic model-based tissue segmentation algorithm. The performance of the compression system is then enhanced by compressing different regions of the image volume at different bit rates, guided by prior knowledge about the location of important anatomical regions in the image. Application of the new system to magnetic resonance images is shown to produce compression results superior to the conventional methods, both subjectively and with respect to the segmentation similarity metric.
Investigation of Ego and Task Orientation among International Wrestling Referees
Directory of Open Access Journals (Sweden)
I. Barbas
2016-12-01
Full Text Available Aim: study was to investigate any possible effect(s of experiences from active membership and participation in task or ego orientations among referees in the sport of wrestling. Material: The sample consisted of 213 international referees from 30 countries (Greece, Turkey, Bulgaria, France, Italy, Germany, Sweden, Finland, Switzerland, Russia, Poland, Hungary, U.S.A, Ukraine, Armenia, Azerbaijan, Iran, Japan, Korea, Mongolia, Kazakhstan, Egypt, Canada, Georgia, Croatia, Uzbekistan, Norway, Cuba, Belarus, & Tunisia. Their age ranged from 26 to 60 yrs. old ( M =43, SD =8.6. During the procedure, the participants were asked to fill a specific questionnaire, the «Task and Ego Orientation in Sport Questionnaire» (Duda & Nicholls, 1992. Results: Results showed that the referees from elite wrestling level’ countries (Russia, Azerbaijan, Iran, Turkey, Georgia, Armenia, Bulgaria, Ukraine, U.S.A., Korea, Japan, Kazakhstan, & Cuba are more task oriented than those from the non-elite wrestling level’ countries. Researchers believe that this occurred because referees from non-elite wrestling level’ countries might have less game-sport experience and more specifically in high level games. At the same time, the Olympic experience referees were more task oriented than the non-Olympic experienced. Conclusion: Referee’s decisions are an important issue in the sport milieu. The investigations in decision-making by referees and factors that affect it are rather scarce and research should focus on such topics. Improvement of decision-making by referees, would lead to safer and better performance. Thus, better understanding of referees’ behavior, through identification and operationalization of the factors affecting it, might lead to more effective selection, training and performance.
Task-Oriented and Relationship-Building Communications between Air Traffic Controllers and Pilots
Directory of Open Access Journals (Sweden)
Inwon Kang
2017-09-01
Full Text Available By questioning the lopsided attention on task-oriented factors in air traffic controller-pilot communication, the current study places an equal weighting on both task-oriented and relationship-building communications, and investigates how each type of communication influences sustainable performance in airline operation team. Results show that both task-oriented and relationship-building communications in terms of sustainability of team process predicted greater communication satisfaction at work. Also, both task interdependence and shared leadership influenced both types of air traffic controller-pilot communication. However, only relationship-building communication had a direct influence on perceived work performance whereas task-oriented communication had not. Along with task-oriented factors, this study raises the relationship-oriented factors as important resources for the sustainable team performance in airline industry.
International Nuclear Information System (INIS)
Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C.; Lockyer, N.; Vanberg, R.
1990-01-01
A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequence, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a proposed new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of Gigabytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the proposed Scalable Parallel Open Architecture data acquisition system are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build a prototype of the proposed data acquisition system architecture is given in the paper. The major component of the system, a self-routing parallel event builder, is described in detail
Sung, Chul
2013-08-01
Accurate estimation of neuronal count and distribution is central to the understanding of the organization and layout of cortical maps in the brain, and changes in the cell population induced by brain disorders. High-throughput 3D microscopy techniques such as Knife-Edge Scanning Microscopy (KESM) are enabling whole-brain survey of neuronal distributions. Data from such techniques pose serious challenges to quantitative analysis due to the massive, growing, and sparsely labeled nature of the data. In this paper, we present a scalable, incremental learning algorithm for cell body detection that can address these issues. Our algorithm is computationally efficient (linear mapping, non-iterative) and does not require retraining (unlike gradient-based approaches) or retention of old raw data (unlike instance-based learning). We tested our algorithm on our rat brain Nissl data set, showing superior performance compared to an artificial neural network-based benchmark, and also demonstrated robust performance in a scenario where the data set is rapidly growing in size. Our algorithm is also highly parallelizable due to its incremental nature, and we demonstrated this empirically using a MapReduce-based implementation of the algorithm. We expect our scalable, incremental learning approach to be widely applicable to medical imaging domains where there is a constant flux of new data. © 2013 IEEE.
Yusa, Yasunori; Okada, Hiroshi; Yamada, Tomonori; Yoshimura, Shinobu
2018-04-01
A domain decomposition method for large-scale elastic-plastic problems is proposed. The proposed method is based on a quasi-Newton method in conjunction with a balancing domain decomposition preconditioner. The use of a quasi-Newton method overcomes two problems associated with the conventional domain decomposition method based on the Newton-Raphson method: (1) avoidance of a double-loop iteration algorithm, which generally has large computational complexity, and (2) consideration of the local concentration of nonlinear deformation, which is observed in elastic-plastic problems with stress concentration. Moreover, the application of a balancing domain decomposition preconditioner ensures scalability. Using the conventional and proposed domain decomposition methods, several numerical tests, including weak scaling tests, were performed. The convergence performance of the proposed method is comparable to that of the conventional method. In particular, in elastic-plastic analysis, the proposed method exhibits better convergence performance than the conventional method.
Niculescu, A.I.
2011-01-01
This dissertation focuses on the design and evaluation of speech-based conversational interfaces for task-oriented dialogues. Conversational interfaces are software programs enabling interaction with computer devices through natural language dialogue. Even though processing conversational speech is
Task oriented training improves the balance outcome & reducing fall risk in diabetic population
Ghazal, Javeria; Malik, Arshad Nawaz; Amjad, Imran
2016-01-01
Objectives: The objective was to determine the balance impairments and to compare task oriented versus traditional balance training in fall reduction among diabetic patients. Methods: The randomized control trial with descriptive survey and 196 diabetic patients were recruited to assess balance impairments through purposive sampling technique. Eighteen patients were randomly allocated into two groups; task oriented balance training group TOB (n=8) and traditional balance training group TBT (n...
Energy Technology Data Exchange (ETDEWEB)
Pebay, Philippe [Sandia National Laboratories (SNL-CA), Livermore, CA (United States); Terriberry, Timothy B. [Xiph.Org Foundation, Arlington, VA (United States); Kolla, Hemanth [Sandia National Laboratories (SNL-CA), Livermore, CA (United States); Bennett, Janine [Sandia National Laboratories (SNL-CA), Livermore, CA (United States)
2016-03-29
Formulas for incremental or parallel computation of second order central moments have long been known, and recent extensions of these formulas to univariate and multivariate moments of arbitrary order have been developed. Formulas such as these, are of key importance in scenarios where incremental results are required and in parallel and distributed systems where communication costs are high. We survey these recent results, and improve them with arbitrary-order, numerically stable one-pass formulas which we further extend with weighted and compound variants. We also develop a generalized correction factor for standard two-pass algorithms that enables the maintenance of accuracy over nearly the full representable range of the input, avoiding the need for extended-precision arithmetic. We then empirically examine algorithm correctness for pairwise update formulas up to order four as well as condition number and relative error bounds for eight different central moment formulas, each up to degree six, to address the trade-offs between numerical accuracy and speed of the various algorithms. Finally, we demonstrate the use of the most elaborate among the above mentioned formulas, with the utilization of the compound moments for a practical large-scale scientific application.
Energy Technology Data Exchange (ETDEWEB)
Albonesi, David; Burtscher, Martin
2009-04-17
The goal of this project has been to develop a lossless compression algorithm for message-passing libraries that can accelerate HPC systems by reducing the communication time. Because both compression and decompression have to be performed in software in real time, the algorithm has to be extremely fast while still delivering a good compression ratio. During the first half of this project, they designed a new compression algorithm called FPC for scientific double-precision data, made the source code available on the web, and published two papers describing its operation, the first in the proceedings of the Data Compression Conference and the second in the IEEE Transactions on Computers. At comparable average compression ratios, this algorithm compresses and decompresses 10 to 100 times faster than BZIP2, DFCM, FSD, GZIP, and PLMI on the three architectures tested. With prediction tables that fit into the CPU's L1 data acache, FPC delivers a guaranteed throughput of six gigabits per second on a 1.6 GHz Itanium 2 system. The C source code and documentation of FPC are posted on-line and have already been downloaded hundreds of times. To evaluate FPC, they gathered 13 real-world scientific datasets from around the globe, including satellite data, crash-simulation data, and messages from HPC systems. Based on the large number of requests they received, they also made these datasets available to the community (with permission of the original sources). While FPC represents a great step forward, it soon became clear that its throughput was too slow for the emerging 10 gigabits per second networks. Hence, no speedup can be gained by including this algorithm in an MPI library. They therefore changed the aim of the second half of the project. Instead of implementing FPC in an MPI library, they refocused their efforts to develop a parallel compression algorithm to further boost the throughput. After all, all modern high-end microprocessors contain multiple CPUs on a
International Nuclear Information System (INIS)
Samatova, N F; Schmidt, M C; Hendrix, W; Breimyer, P; Thomas, K; Park, B-H
2008-01-01
Data-driven construction of predictive models for biological systems faces challenges from data intensity, uncertainty, and computational complexity. Data-driven model inference is often considered a combinatorial graph problem where an enumeration of all feasible models is sought. The data-intensive and the NP-hard nature of such problems, however, challenges existing methods to meet the required scale of data size and uncertainty, even on modern supercomputers. Maximal clique enumeration (MCE) in a graph derived from such biological data is often a rate-limiting step in detecting protein complexes in protein interaction data, finding clusters of co-expressed genes in microarray data, or identifying clusters of orthologous genes in protein sequence data. We report two key advances that address this challenge. We designed and implemented the first (to the best of our knowledge) parallel MCE algorithm that scales linearly on thousands of processors running MCE on real-world biological networks with thousands and hundreds of thousands of vertices. In addition, we proposed and developed the Graph Perturbation Theory (GPT) that establishes a foundation for efficiently solving the MCE problem in perturbed graphs, which model the uncertainty in the data. GPT formulates necessary and sufficient conditions for detecting the differences between the sets of maximal cliques in the original and perturbed graphs and reduces the enumeration time by more than 80% compared to complete recomputation
Relations- and task-oriented behaviour of school leaders: Cases from primary schools in Finland
Directory of Open Access Journals (Sweden)
Mani Man Singh Rajbhandari
2016-08-01
principals, teachers, special-education teachers, and administrative staff members. The results suggest that leadership behavioural styles in terms of relations-oriented and task-oriented behaviour are equally important for accommodating changes and development in schools. The results suggest that relations-oriented behaviour was preferred by those who had been in the organisation for a longer time. The task-oriented behavioural style was found to be adopted when changes were required by the municipality (school district, which needed to be urgently addressed to meet the current requirements for school infrastructural development and changes in the educational system. In addition, the school leaders with task-oriented behaviour were more effective, while leaders with relations-oriented behaviour were efficient and generated social harmony. These findings suggest that contextual variations enabled flexibility in leadership behavioural style.
Srikesavan, Cynthia Swarnalatha; Shay, Barbara; Robinson, David B; Szturm, Tony
2013-03-09
Significant restriction in the ability to participate in home, work and community life results from pain, fatigue, joint damage, stiffness and reduced joint range of motion and muscle strength in people with rheumatoid arthritis or osteoarthritis of the hand. With modest evidence on the therapeutic effectiveness of conventional hand exercises, a task-oriented training program via real life object manipulations has been developed for people with arthritis. An innovative, computer-based gaming platform that allows a broad range of common objects to be seamlessly transformed into therapeutic input devices through instrumentation with a motion-sense mouse has also been designed. Personalized objects are selected to target specific training goals such as graded finger mobility, strength, endurance or fine/gross dexterous functions. The movements and object manipulation tasks that replicate common situations in everyday living will then be used to control and play any computer game, making practice challenging and engaging. The ongoing study is a 6-week, single-center, parallel-group, equally allocated and assessor-blinded pilot randomized controlled trial. Thirty people with rheumatoid arthritis or osteoarthritis affecting the hand will be randomized to receive either conventional hand exercises or the task-oriented training. The purpose is to determine a preliminary estimation of therapeutic effectiveness and feasibility of the task-oriented training program. Performance based and self-reported hand function, and exercise compliance are the study outcomes. Changes in outcomes (pre to post intervention) within each group will be assessed by paired Student t test or Wilcoxon signed-rank test and between groups (control versus experimental) post intervention using unpaired Student t test or Mann-Whitney U test. The study findings will inform decisions on the feasibility, safety and completion rate and will also provide preliminary data on the treatment effects of the task-oriented
Interaction between Task Oriented and Affective Information Processing in Cognitive Robotics
Haazebroek, Pascal; van Dantzig, Saskia; Hommel, Bernhard
There is an increasing interest in endowing robots with emotions. Robot control however is still often very task oriented. We present a cognitive architecture that allows the combination of and interaction between task representations and affective information processing. Our model is validated by comparing simulation results with empirical data from experimental psychology.
Small, Task-Oriented Groups: Conflict, Conflict Management, Satisfaction, and Decision Quality.
Wall, Victor D., Jr.; And Others
1987-01-01
Examined relationship among amount of conflict experienced, the style of its management, individual satisfaction, and decision quality of small, task-oriented groups using 129 college student subjects in 24 groups. Data suggest a curvilinear relationship between the number of conflict episodes experienced by group members and the subsequent…
Task-oriented training in rehabilitation after stroke : a systematic review
Drs. Marijke Rensink; Eline Eline Lindeman; Marieke Schuurmans; Thóra Hafsteinsdóttir
2008-01-01
This paper is a report of a review conducted to provide an overview of the evidence in the literature on task-oriented training of stroke survivors and its relevance in daily nursing practice. Background: Stroke is the second leading cause of death and one of the leading causes of adult disability
Task oriented training improves the balance outcome & reducing fall risk in diabetic population.
Ghazal, Javeria; Malik, Arshad Nawaz; Amjad, Imran
2016-01-01
The objective was to determine the balance impairments and to compare task oriented versus traditional balance training in fall reduction among diabetic patients. The randomized control trial with descriptive survey and 196 diabetic patients were recruited to assess balance impairments through purposive sampling technique. Eighteen patients were randomly allocated into two groups; task oriented balance training group TOB (n=8) and traditional balance training group TBT (n=10). The inclusion criteria were 30-50 years age bracket and diagnosed cases of Diabetes Mellitus with neuropathy. The demographics were taken through standardized & valid assessment tools include Berg Balance Scale and Functional Reach Test. The measurements were obtained at baseline, after 04 and 08 weeks of training. The mean age of the participants was 49 ±6.79. The result shows that 165(84%) were at moderate risk of fall and 31(15%) were at mild risk of fall among total 196 diabetic patients. There was significant improvement (p balance training group for dynamic balance, anticipatory balance and reactive balance after 8 weeks of training as compare to traditional balance training. Task oriented balance training is effective in improving the dynamic, anticipator and reactive balance. The task oriented training reduces the risk of falling through enhancing balance outcome.
Task-Oriented Spoken Dialog System for Second-Language Learning
Kwon, Oh-Woog; Kim, Young-Kil; Lee, Yunkeun
2016-01-01
This paper introduces a Dialog-Based Computer Assisted second-Language Learning (DB-CALL) system using task-oriented dialogue processing technology. The system promotes dialogue with a second-language learner for a specific task, such as purchasing tour tickets, ordering food, passing through immigration, etc. The dialog system plays a role of a…
An Efficient Framework for Development of Task-Oriented Dialog Systems in a Smart Home Environment
Directory of Open Access Journals (Sweden)
Youngmin Park
2018-05-01
Full Text Available In recent times, with the increasing interest in conversational agents for smart homes, task-oriented dialog systems are being actively researched. However, most of these studies are focused on the individual modules of such a system, and there is an evident lack of research on a dialog framework that can integrate and manage the entire dialog system. Therefore, in this study, we propose a framework that enables the user to effectively develop an intelligent dialog system. The proposed framework ontologically expresses the knowledge required for the task-oriented dialog system’s process and can build a dialog system by editing the dialog knowledge. In addition, the framework provides a module router that can indirectly run externally developed modules. Further, it enables a more intelligent conversation by providing a hierarchical argument structure (HAS to manage the various argument representations included in natural language sentences. To verify the practicality of the framework, an experiment was conducted in which developers without any previous experience in developing a dialog system developed task-oriented dialog systems using the proposed framework. The experimental results show that even beginner dialog system developers can develop a high-level task-oriented dialog system.
An Efficient Framework for Development of Task-Oriented Dialog Systems in a Smart Home Environment.
Park, Youngmin; Kang, Sangwoo; Seo, Jungyun
2018-05-16
In recent times, with the increasing interest in conversational agents for smart homes, task-oriented dialog systems are being actively researched. However, most of these studies are focused on the individual modules of such a system, and there is an evident lack of research on a dialog framework that can integrate and manage the entire dialog system. Therefore, in this study, we propose a framework that enables the user to effectively develop an intelligent dialog system. The proposed framework ontologically expresses the knowledge required for the task-oriented dialog system's process and can build a dialog system by editing the dialog knowledge. In addition, the framework provides a module router that can indirectly run externally developed modules. Further, it enables a more intelligent conversation by providing a hierarchical argument structure (HAS) to manage the various argument representations included in natural language sentences. To verify the practicality of the framework, an experiment was conducted in which developers without any previous experience in developing a dialog system developed task-oriented dialog systems using the proposed framework. The experimental results show that even beginner dialog system developers can develop a high-level task-oriented dialog system.
Task-oriented control of Single-Master Multi-Slave Manipulator System
International Nuclear Information System (INIS)
Kosuge, Kazuhiro; Ishikawa, Jun; Furuta, Katsuhisa; Hariki, Kazuo; Sakai, Masaru.
1994-01-01
A master-slave manipulator system, in general, consists of a master arm manipulated by a human and a slave arm used for real tasks. Some tasks, such as manipulation of a heavy object, etc., require two or more slave arms operated simultaneously. A Single-Master Multi-Slave Manipulator System consists of a master arm with six degrees of freedom and two or more slave arms, each of which has six or more degrees of freedom. In this system, a master arm controls the task-oriented variables using Virtual Internal Model (VIM) based on the concept of 'Task-Oriented Control'. VIM is a reference model driven by sensory information and used to describe the desired relation between the motion of a master arm and task-oriented variables. The motion of slave arms are controlled based on the task oriented variables generated by VIM and tailors the system to meet specific tasks. A single-master multi-slave manipulator system, having two slave arms, is experimentally developed and illustrates the concept. (author)
Teachers' Use of Potentially Reinforcing Behaviors and Students' Task-Oriented Behavior.
Anderson, Lorin; And Others
The present study focuses on two major questions. First, how often are potentially reinforcing behaviors emitted by teachers in naturally occurring classrooms? Second, what is the relationship between the display of potentially reinforcing behaviors by the teacher and the task-orientation of randomly selected students in the classrooms. Students…
Wevers, Lotte; van de Port, Ingrid; Vermue, Mathijs; Mead, Gillian; Kwakkel, Gert
Background and Purpose-There is increasing interest in the potential benefits of circuit class training after stroke, but its effectiveness is uncertain. Our aim was to systematically review randomized, controlled trials of task-oriented circuit class training on gait and gait-related activities in
International Nuclear Information System (INIS)
Alnaes, K.; Kristiansen, E.H.; Gustavson, D.B.; James, D.V.
1990-01-01
The Scalable Coherent Interface (IEEE P1596) is establishing an interface standard for very high performance multiprocessors, supporting a cache-coherent-memory model scalable to systems with up to 64K nodes. This Scalable Coherent Interface (SCI) will supply a peak bandwidth per node of 1 GigaByte/second. The SCI standard should facilitate assembly of processor, memory, I/O and bus bridge cards from multiple vendors into massively parallel systems with throughput far above what is possible today. The SCI standard encompasses two levels of interface, a physical level and a logical level. The physical level specifies electrical, mechanical and thermal characteristics of connectors and cards that meet the standard. The logical level describes the address space, data transfer protocols, cache coherence mechanisms, synchronization primitives and error recovery. In this paper we address logical level issues such as packet formats, packet transmission, transaction handshake, flow control, and cache coherence. 11 refs., 10 figs
Krü ger, Jens J.; Hadwiger, Markus
2014-01-01
In computer science in general and in particular the field of high performance computing and supercomputing the term scalable plays an important role. It indicates that a piece of hardware, a concept, an algorithm, or an entire system scales
International Nuclear Information System (INIS)
Chang, Pyung-Hun; Park, Joon-Young
2002-01-01
This paper presents a Task Oriented Design method for robot kinematics based on grid method, widely used in finite difference method and heat transfer/fluid flow analyses. This approach drastically reduces complexities and computational burden due to previous approaches. More specifically, the grid method with a new formulation simplifies the design to a problem of three-design-variable unit grid, which does not require to solve inverse/forward kinematics. The effectiveness of the grid method has been confirmed through a kinematics design of a robot for nuclear power plants. (author)
Discourse pragmatics and ellipsis resolution in task-oriented natural language interfaces
Energy Technology Data Exchange (ETDEWEB)
Carbonell, J.G.
1983-01-01
The author reviews discourse phenomena that occur frequently in task-oriented man-machine dialogs, reporting on an empirical study that demonstrates the necessity of handling ellipsis, anaphora, extragrammaticality, intersentential metalanguage, and other abbreviatory devices in order to achieve convivial user interaction. Invariably, users prefer to generate terse or fragmentary utterances instead of longer, more complete stand-alone expressions, even when given clear instructions to the contrary. The xcalibur expert system interface is designed to meet these needs, including generalized ellipsis resolution by means of a rule-based caseframe method superior to previous semantic grammar approaches. 23 references.
Krüger, Jens J.
2014-01-01
In computer science in general and in particular the field of high performance computing and supercomputing the term scalable plays an important role. It indicates that a piece of hardware, a concept, an algorithm, or an entire system scales with the size of the problem, i.e., it can not only be used in a very specific setting but it\\'s applicable for a wide range of problems. From small scenarios to possibly very large settings. In this spirit, there exist a number of fixed areas of research on scalability. There are works on scalable algorithms, scalable architectures but what are scalable devices? In the context of this chapter, we are interested in a whole range of display devices, ranging from small scale hardware such as tablet computers, pads, smart-phones etc. up to large tiled display walls. What interests us mostly is not so much the hardware setup but mostly the visualization algorithms behind these display systems that scale from your average smart phone up to the largest gigapixel display walls.
Lee, Hyung Young; Kim, You Lim; Lee, Suk Min
2015-06-01
[Purpose] This study aimed to investigate the clinical effects of virtual reality-based training and task-oriented training on balance performance in stroke patients. [Subjects and Methods] The subjects were randomly allocated to 2 groups: virtual reality-based training group (n = 12) and task-oriented training group (n = 12). The patients in the virtual reality-based training group used the Nintendo Wii Fit Plus, which provided visual and auditory feedback as well as the movements that enabled shifting of weight to the right and left sides, for 30 min/day, 3 times/week for 6 weeks. The patients in the task-oriented training group practiced additional task-oriented programs for 30 min/day, 3 times/week for 6 weeks. Patients in both groups also underwent conventional physical therapy for 60 min/day, 5 times/week for 6 weeks. [Results] Balance and functional reach test outcomes were examined in both groups. The results showed that the static balance and functional reach test outcomes were significantly higher in the virtual reality-based training group than in the task-oriented training group. [Conclusion] This study suggested that virtual reality-based training might be a more feasible and suitable therapeutic intervention for dynamic balance in stroke patients compared to task-oriented training.
The Application of Task- oriented Teaching Approach to Enhancing Communicative Competence of EFL
Directory of Open Access Journals (Sweden)
LI Mingxin
2014-02-01
Full Text Available To communicate is the primary goal of most foreign language learning (EFL. As an important component of the four macro skills (listening, speaking, reading and writing, reading should also serve this purpose. However, traditional methodology still dominates extensive reading teaching in most of the universities. To promote a communicative extensive reading class, we may start by designing various tasks and activities. This paper introduces a task-oriented approach in English extensive reading class. According to Nunan, task-oriented teaching involves learners in the classroom to comprehend, manipulate, produce or interact in the target language, but the focus is on the meaning rather than the form. In light of psycholinguistic model and schema theory model, the methodology covers information-gap activity, opinion-gap activity and reasoning-gap activity which can be run in the class. The task-approaches make the interaction between teacher and students, students and students more active and meaningful. Skills of reading to solve communicative problems are always treated consciously. This approach may hopefully result in some improvement on the teaching of English reading.
Advert saliency distracts children's visual attention during task-oriented internet use
Directory of Open Access Journals (Sweden)
Nils eHolmberg
2014-02-01
Full Text Available The general research question of the present study was to assess the impact of visually salient online adverts on children's task-oriented internet use. In order to answer this question, an experimental study was constructed in which 9-year-old and 12-year-old Swedish children were asked to solve a number of tasks while interacting with a mockup website. In each trial, web adverts in several saliency conditions were presented. By both measuring children's task accuracy, as well as the visual processing involved in solving these tasks, this study allows us to infer how two types of visual saliency affect children's attentional behavior, and whether such behavioral effects also impacts their task performance. Analyses show that low-level visual features and task relevance in online adverts have different effects on performance measures and process measures respectively. Whereas task performance is stable with regard to several advert saliency conditions, a marked effect is seen on children's gaze behavior. On the other hand, task performance is shown to be more sensitive to individual differences such as age, gender and level of gaze control. The results provide evidence about cognitive and behavioral distraction effects in children's task-oriented internet use caused by visual saliency in online adverts. The experiment suggests that children to some extent are able to compensate for behavioral effects caused by distracting visual stimuli when solving prospective memory tasks. Suggestions are given for further research into the interdiciplinary area between media research and cognitive science.
Nightingale, James; Wang, Qi; Grecos, Christos
2015-02-01
In recent years video traffic has become the dominant application on the Internet with global year-on-year increases in video-oriented consumer services. Driven by improved bandwidth in both mobile and fixed networks, steadily reducing hardware costs and the development of new technologies, many existing and new classes of commercial and industrial video applications are now being upgraded or emerging. Some of the use cases for these applications include areas such as public and private security monitoring for loss prevention or intruder detection, industrial process monitoring and critical infrastructure monitoring. The use of video is becoming commonplace in defence, security, commercial, industrial, educational and health contexts. Towards optimal performances, the design or optimisation in each of these applications should be context aware and task oriented with the characteristics of the video stream (frame rate, spatial resolution, bandwidth etc.) chosen to match the use case requirements. For example, in the security domain, a task-oriented consideration may be that higher resolution video would be required to identify an intruder than to simply detect his presence. Whilst in the same case, contextual factors such as the requirement to transmit over a resource-limited wireless link, may impose constraints on the selection of optimum task-oriented parameters. This paper presents a novel, conceptually simple and easily implemented method of assessing video quality relative to its suitability for a particular task and dynamically adapting videos streams during transmission to ensure that the task can be successfully completed. Firstly we defined two principle classes of tasks: recognition tasks and event detection tasks. These task classes are further subdivided into a set of task-related profiles, each of which is associated with a set of taskoriented attributes (minimum spatial resolution, minimum frame rate etc.). For example, in the detection class
Hsieh, Yu-Wei; Wu, Ching-Yi; Wang, Wei-En; Lin, Keh-Chung; Chang, Ku-Chou; Chen, Chih-Chi; Liu, Chien-Ting
2017-02-01
To investigate the treatment effects of bilateral robotic priming combined with the task-oriented approach on motor impairment, disability, daily function, and quality of life in patients with subacute stroke. A randomized controlled trial. Occupational therapy clinics in medical centers. Thirty-one subacute stroke patients were recruited. Participants were randomly assigned to receive bilateral priming combined with the task-oriented approach (i.e., primed group) or to the task-oriented approach alone (i.e., unprimed group) for 90 minutes/day, 5 days/week for 4 weeks. The primed group began with the bilateral priming technique by using a bimanual robot-aided device. Motor impairments were assessed by the Fugal-Meyer Assessment, grip strength, and the Box and Block Test. Disability and daily function were measured by the modified Rankin Scale, the Functional Independence Measure, and actigraphy. Quality of life was examined by the Stroke Impact Scale. The primed and unprimed groups improved significantly on most outcomes over time. The primed group demonstrated significantly better improvement on the Stroke Impact Scale strength subscale ( p = 0.012) and a trend for greater improvement on the modified Rankin Scale ( p = 0.065) than the unprimed group. Bilateral priming combined with the task-oriented approach elicited more improvements in self-reported strength and disability degrees than the task-oriented approach by itself. Further large-scale research with at least 31 participants in each intervention group is suggested to confirm the study findings.
Scalable shared-memory multiprocessing
Lenoski, Daniel E
1995-01-01
Dr. Lenoski and Dr. Weber have experience with leading-edge research and practical issues involved in implementing large-scale parallel systems. They were key contributors to the architecture and design of the DASH multiprocessor. Currently, they are involved with commercializing scalable shared-memory technology.
International Nuclear Information System (INIS)
Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C.; Lockyer, N.; VanBerg, R.
1989-12-01
A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequences, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of GigaBytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the system architecture are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build the self-routing parallel event builder will also be given in the paper. 3 figs., 1 tab
Crockett, Thomas W.
1995-01-01
This article provides a broad introduction to the subject of parallel rendering, encompassing both hardware and software systems. The focus is on the underlying concepts and the issues which arise in the design of parallel rendering algorithms and systems. We examine the different types of parallelism and how they can be applied in rendering applications. Concepts from parallel computing, such as data decomposition, task granularity, scalability, and load balancing, are considered in relation to the rendering problem. We also explore concepts from computer graphics, such as coherence and projection, which have a significant impact on the structure of parallel rendering algorithms. Our survey covers a number of practical considerations as well, including the choice of architectural platform, communication and memory requirements, and the problem of image assembly and display. We illustrate the discussion with numerous examples from the parallel rendering literature, representing most of the principal rendering methods currently used in computer graphics.
Wang, Hui-Yi; Long, I-Man; Liu, Mei-Fang
2012-01-01
Individuals with Down syndrome (DS) have been characterized by greater postural sway in quiet stance and insufficient motor ability. However, there is a lack of studies to explore the properties of dynamic postural sway, especially under conditions of task-oriented movement. The purpose of this study was to investigate the relationships between…
Molier, B.I.; Prange, Grada Berendina; Krabben, T.; Stienen, Arno; van der Kooij, Herman; Buurke, Jaap; Jannink, M.J.A.; Hermens, Hermanus J.
2011-01-01
Feedback is an important element in motor learning during rehabilitation therapy following stroke. The objective of this pilot study was to better understand the effect of position feedback during task-oriented reach training of the upper limb in people with chronic stroke. Five subjects
Son, Bo-Young; Bang, Yo-Soon; Hwang, Min-Ji; Oh, Eun-Ju
2017-08-01
[Purpose] This study investigates the effects of task-oriented activities on hand function, cognitive function, and self-expression of the elderly with dementia, and then identify the influencing factors on self-expression in sub-factors of dependent variables. [Subjects and Methods] Forty elderly persons were divided into two groups: intervention group (n=20) and control group (n=20). The interventions were applied to the subjects 3 times a week, 50 minutes per each time, for a total of five weeks. We measured the jamar hand dynamometer test for grip strength, the jamar hydraulic pinch gauge test for prehension test, nine-hole pegboard test for coordination test, and Loewenstein Occupational Therapy Cognitive Assessment-Geriatric Population for cognitive function, and self-expression rating scale for self-expression test. [Results] The task-oriented activities promoted hand function, cognitive function (visual perception, spatial perception, visuomotor organization, attention & concentration) and self-expression of the elderly with early dementia, and the factors influencing the self-expression were cognitive function (visual perception) and hand function (coordination). The study showed that the task-oriented program enabled self-expression by improving hand function and cognitive function. [Conclusion] This study suggested that there should be provided the task-oriented program for prevention and treatment of the elderly with early dementia in the clinical settings and it was considered that results have a value as basic data that can be verified relationship of hand function, cognitive function, and self-expression.
Temporal motifs reveal collaboration patterns in online task-oriented networks
Xuan, Qi; Fang, Huiting; Fu, Chenbo; Filkov, Vladimir
2015-05-01
Real networks feature layers of interactions and complexity. In them, different types of nodes can interact with each other via a variety of events. Examples of this complexity are task-oriented social networks (TOSNs), where teams of people share tasks towards creating a quality artifact, such as academic research papers or software development in commercial or open source environments. Accomplishing those tasks involves both work, e.g., writing the papers or code, and communication, to discuss and coordinate. Taking into account the different types of activities and how they alternate over time can result in much more precise understanding of the TOSNs behaviors and outcomes. That calls for modeling techniques that can accommodate both node and link heterogeneity as well as temporal change. In this paper, we report on methodology for finding temporal motifs in TOSNs, limited to a system of two people and an artifact. We apply the methods to publicly available data of TOSNs from 31 Open Source Software projects. We find that these temporal motifs are enriched in the observed data. When applied to software development outcome, temporal motifs reveal a distinct dependency between collaboration and communication in the code writing process. Moreover, we show that models based on temporal motifs can be used to more precisely relate both individual developer centrality and team cohesion to programmer productivity than models based on aggregated TOSNs.
A novel task-oriented optimal design for P300-based brain-computer interfaces.
Zhou, Zongtan; Yin, Erwei; Liu, Yang; Jiang, Jun; Hu, Dewen
2014-10-01
Objective. The number of items of a P300-based brain-computer interface (BCI) should be adjustable in accordance with the requirements of the specific tasks. To address this issue, we propose a novel task-oriented optimal approach aimed at increasing the performance of general P300 BCIs with different numbers of items. Approach. First, we proposed a stimulus presentation with variable dimensions (VD) paradigm as a generalization of the conventional single-character (SC) and row-column (RC) stimulus paradigms. Furthermore, an embedding design approach was employed for any given number of items. Finally, based on the score-P model of each subject, the VD flash pattern was selected by a linear interpolation approach for a certain task. Main results. The results indicate that the optimal BCI design consistently outperforms the conventional approaches, i.e., the SC and RC paradigms. Specifically, there is significant improvement in the practical information transfer rate for a large number of items. Significance. The results suggest that the proposed optimal approach would provide useful guidance in the practical design of general P300-based BCIs.
Energy Technology Data Exchange (ETDEWEB)
Robert W. Numrich
2008-04-22
The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to
Park, Seong-Wook; Park, Junyoung; Bong, Kyeongryeol; Shin, Dongjoo; Lee, Jinmook; Choi, Sungpill; Yoo, Hoi-Jun
2015-12-01
Deep Learning algorithm is widely used for various pattern recognition applications such as text recognition, object recognition and action recognition because of its best-in-class recognition accuracy compared to hand-crafted algorithm and shallow learning based algorithms. Long learning time caused by its complex structure, however, limits its usage only in high-cost servers or many-core GPU platforms so far. On the other hand, the demand on customized pattern recognition within personal devices will grow gradually as more deep learning applications will be developed. This paper presents a SoC implementation to enable deep learning applications to run with low cost platforms such as mobile or portable devices. Different from conventional works which have adopted massively-parallel architecture, this work adopts task-flexible architecture and exploits multiple parallelism to cover complex functions of convolutional deep belief network which is one of popular deep learning/inference algorithms. In this paper, we implement the most energy-efficient deep learning and inference processor for wearable system. The implemented 2.5 mm × 4.0 mm deep learning/inference processor is fabricated using 65 nm 8-metal CMOS technology for a battery-powered platform with real-time deep inference and deep learning operation. It consumes 185 mW average power, and 213.1 mW peak power at 200 MHz operating frequency and 1.2 V supply voltage. It achieves 411.3 GOPS peak performance and 1.93 TOPS/W energy efficiency, which is 2.07× higher than the state-of-the-art.
Scalable algorithms for contact problems
Dostál, Zdeněk; Sadowská, Marie; Vondrák, Vít
2016-01-01
This book presents a comprehensive and self-contained treatment of the authors’ newly developed scalable algorithms for the solutions of multibody contact problems of linear elasticity. The brand new feature of these algorithms is theoretically supported numerical scalability and parallel scalability demonstrated on problems discretized by billions of degrees of freedom. The theory supports solving multibody frictionless contact problems, contact problems with possibly orthotropic Tresca’s friction, and transient contact problems. It covers BEM discretization, jumping coefficients, floating bodies, mortar non-penetration conditions, etc. The exposition is divided into four parts, the first of which reviews appropriate facets of linear algebra, optimization, and analysis. The most important algorithms and optimality results are presented in the third part of the volume. The presentation is complete, including continuous formulation, discretization, decomposition, optimality results, and numerical experimen...
Srikesavan, Cynthia Swarnalatha; Shay, Barbara; Szturm, Tony
2016-09-13
To examine the feasibility of a clinical trial on a novel, home-based task-oriented training with conventional hand exercises in people with rheumatoid arthritis or hand osteoarthritis. To explore the experiences of participants who completed their respective home exercise programmes. Thirty volunteer participants aged between 30 and 60 years and diagnosed with rheumatoid arthritis or hand osteoarthritis were proposed for a single-center, assessor-blinded, randomized controlled trial ( ClinicalTrials.gov : NCT01635582). Participants received task-oriented training with interactive computer games and objects of daily life or finger mobility and strengthening exercises. Both programmes were home based and were done four sessions per week with 20 minutes each session for 6 weeks. Major feasibility outcomes were number of volunteers screened, randomized, and retained; completion of blinded assessments, exercise training, and home exercise sessions; equipment and data management; and clinical outcomes of hand function. Reaching the recruitment target in 18 months and achieving exercise compliance >80% were set as success criteria. Concurrent with the trial, focus group interviews explored experiences of those participants who completed their respective programmes. After trial initiation, revisions in inclusion criteria were required to promote recruitment. A total of 17 participants were randomized and 15 were retained. Completion of assessments, exercise training, and home exercise sessions; equipment and data collection and management demonstrated excellent feasibility. Both groups improved in hand function outcomes and exercise compliance was above 85%. Participants perceived both programmes as appropriate and acceptable. Participants who completed task-oriented training also agreed that playing different computer games was enjoyable, engaging, and motivating. Findings demonstrate initial evidence on recruitment, feasibility of trial procedures, and acceptability of
2017-05-06
In an effort to address performance gaps we devised a teaching paradigm, called Task-Oriented Role Ass ignment’, in which we have delegated a...task delegation , thereby improving NRP performance. Health care professionals taking the NRP course were randomized to either the control group, which...such as leadership (mean = 4· control, s study; p = 0.05). However, both groups scored similarly in overall NRP task performance with mean scores of
Scalable Domain Decomposed Monte Carlo Particle Transport
Energy Technology Data Exchange (ETDEWEB)
O' Brien, Matthew Joseph [Univ. of California, Davis, CA (United States)
2013-12-05
In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation.
Scalable Nonlinear Compact Schemes
Energy Technology Data Exchange (ETDEWEB)
Ghosh, Debojyoti [Argonne National Lab. (ANL), Argonne, IL (United States); Constantinescu, Emil M. [Univ. of Chicago, IL (United States); Brown, Jed [Univ. of Colorado, Boulder, CO (United States)
2014-04-01
In this work, we focus on compact schemes resulting in tridiagonal systems of equations, specifically the fifth-order CRWENO scheme. We propose a scalable implementation of the nonlinear compact schemes by implementing a parallel tridiagonal solver based on the partitioning/substructuring approach. We use an iterative solver for the reduced system of equations; however, we solve this system to machine zero accuracy to ensure that no parallelization errors are introduced. It is possible to achieve machine-zero convergence with few iterations because of the diagonal dominance of the system. The number of iterations is specified a priori instead of a norm-based exit criterion, and collective communications are avoided. The overall algorithm thus involves only point-to-point communication between neighboring processors. Our implementation of the tridiagonal solver differs from and avoids the drawbacks of past efforts in the following ways: it introduces no parallelization-related approximations (multiprocessor solutions are exactly identical to uniprocessor ones), it involves minimal communication, the mathematical complexity is similar to that of the Thomas algorithm on a single processor, and it does not require any communication and computation scheduling.
Joint Experimentation on Scalable Parallel Processors (JESPP)
National Research Council Canada - National Science Library
Davis, Dan M; Lucas, Robert F; Yao, Ka-Thia; Wagenbrath, Gene
2006-01-01
...) required expansion of its joint semi-automated forces (JSAF) code capabilities; including number of entities, behavior complexity, terrain resolution, infrastructure features, environmental realism, and analytical potential...
Enhancing Scalability of Sparse Direct Methods
International Nuclear Information System (INIS)
Li, Xiaoye S.; Demmel, James; Grigori, Laura; Gu, Ming; Xia, Jianlin; Jardin, Steve; Sovinec, Carl; Lee, Lie-Quan
2007-01-01
TOPS is providing high-performance, scalable sparse direct solvers, which have had significant impacts on the SciDAC applications, including fusion simulation (CEMM), accelerator modeling (COMPASS), as well as many other mission-critical applications in DOE and elsewhere. Our recent developments have been focusing on new techniques to overcome scalability bottleneck of direct methods, in both time and memory. These include parallelizing symbolic analysis phase and developing linear-complexity sparse factorization methods. The new techniques will make sparse direct methods more widely usable in large 3D simulations on highly-parallel petascale computers
Maghami, Mahsa; Sukthankar, Gita
In this paper, we introduce an agent-based simulation for investigating the impact of social factors on the formation and evolution of task-oriented groups. Task-oriented groups are created explicitly to perform a task, and all members derive benefits from task completion. However, even in cases when all group members act in a way that is locally optimal for task completion, social forces that have mild effects on choice of associates can have a measurable impact on task completion performance. In this paper, we show how our simulation can be used to model the impact of stereotypes on group formation. In our simulation, stereotypes are based on observable features, learned from prior experience, and only affect an agent's link formation preferences. Even without assuming stereotypes affect the agents' willingness or ability to complete tasks, the long-term modifications that stereotypes have on the agents' social network impair the agents' ability to form groups with sufficient diversity of skills, as compared to agents who form links randomly. An interesting finding is that this effect holds even in cases where stereotype preference and skill existence are completely uncorrelated.
Directory of Open Access Journals (Sweden)
Lorena Ruiz-González
2015-12-01
Full Text Available Abstract The aim of this study was to test the predictive power of dispositional orientations, general self-efficacy and self-determined motivation on fun and boredom in physical education classes, with a sample of 459 adolescents between 13 and 18 with a mean age of 15 years (SD = 0.88. The adolescents responded to four Likert scales: Perceptions of Success Questionnaire, General Self-Efficacy Scale, Sport Motivation Scale and Intrinsic Satisfaction Questionnaire in Sport. The results showed the structural regression model showed that task orientation and general self-efficacy positively predicted self-determined motivation and this in turn positively predicted more fun and less boredom in physical education classes. Consequently, the promotion of an educational task-oriented environment where learners perceive their progress and make them feel more competent, will allow them to overcome the intrinsically motivated tasks, and therefore they will have more fun. Pedagogical implications for less boredom and more fun in physical education classes are discussed.
Ghous, Misbah; Malik, Arshad Nawaz; Amjad, Mian Imran; Kanwal, Maria
2017-07-01
Stroke is one of most disabling condition which directly affects quality of life. The objective of this study was to compare the effect of activity repetition training with salat (prayer) versus task oriented training on functional outcomes of stroke. The study design was randomized control trial and 32 patients were randomly assigned into two groups'. The stroke including infarction or haemorrhagic, age bracket 30-70 years was included. The demographics were recorded and standardized assessment tool included Berg Balance Scale (BBS), Motor assessment scale (MAS) and Time Up and Go Test (TUG). The measurements were obtained at baseline, after four and six weeks. The mean age of the patients was 54.44±10.59 years with 16 (59%) male and 11(41%) female patients. Activity Repetition Training group showed significant improvement (peffective in enhancing the functional status as compare to task oriented training group. The repetition with motivation and concentration is the key in re-learning process of neural plasticity.
Scalable fast multipole accelerated vortex methods
Hu, Qi; Gumerov, Nail A.; Yokota, Rio; Barba, Lorena A.; Duraiswami, Ramani
2014-01-01
-node communication and load balance efficiently, with only a small parallel construction overhead. This algorithm can scale to large-sized clusters showing both strong and weak scalability. Careful error and timing trade-off analysis are also performed for the cutoff
Scalable Frequent Subgraph Mining
Abdelhamid, Ehab
2017-06-19
A graph is a data structure that contains a set of nodes and a set of edges connecting these nodes. Nodes represent objects while edges model relationships among these objects. Graphs are used in various domains due to their ability to model complex relations among several objects. Given an input graph, the Frequent Subgraph Mining (FSM) task finds all subgraphs with frequencies exceeding a given threshold. FSM is crucial for graph analysis, and it is an essential building block in a variety of applications, such as graph clustering and indexing. FSM is computationally expensive, and its existing solutions are extremely slow. Consequently, these solutions are incapable of mining modern large graphs. This slowness is caused by the underlying approaches of these solutions which require finding and storing an excessive amount of subgraph matches. This dissertation proposes a scalable solution for FSM that avoids the limitations of previous work. This solution is composed of four components. The first component is a single-threaded technique which, for each candidate subgraph, needs to find only a minimal number of matches. The second component is a scalable parallel FSM technique that utilizes a novel two-phase approach. The first phase quickly builds an approximate search space, which is then used by the second phase to optimize and balance the workload of the FSM task. The third component focuses on accelerating frequency evaluation, which is a critical step in FSM. To do so, a machine learning model is employed to predict the type of each graph node, and accordingly, an optimized method is selected to evaluate that node. The fourth component focuses on mining dynamic graphs, such as social networks. To this end, an incremental index is maintained during the dynamic updates. Only this index is processed and updated for the majority of graph updates. Consequently, search space is significantly pruned and efficiency is improved. The empirical evaluation shows that the
Au, Mei K; Chan, Wai M; Lee, Lin; Chen, Tracy Mk; Chau, Rosanna Mw; Pang, Marco Yc
2014-10-01
To compare the effectiveness of a core stability program with a task-oriented motor training program in improving motor proficiency in children with developmental coordination disorder (DCD). Randomized controlled pilot trial. Outpatient unit in a hospital. Twenty-two children diagnosed with DCD aged 6-9 years were randomly allocated to the core stability program or the task-oriented motor program. Both groups underwent their respective face-to-face training session once per week for eight consecutive weeks. They were also instructed to carry out home exercises on a daily basis during the intervention period. Short Form of the Bruininks-Oseretsky Test of Motor Proficiency (Second Edition) and Sensory Organization Test at pre- and post-intervention. Intention-to-treat analysis revealed no significant between-group difference in the change of motor proficiency standard score (P=0.717), and composite equilibrium score derived from the Sensory Organization Test (P=0.100). Further analysis showed significant improvement in motor proficiency in both the core stability (mean change (SD)=6.3(5.4); p=0.008) and task-oriented training groups (mean change(SD)=5.1(4.0); P=0.007). The composite equilibrium score was significantly increased in the task-oriented training group (mean change (SD)=6.0(5.5); P=0.009), but not in the core stability group (mean change(SD) =0.0(9.6); P=0.812). In the task-oriented training group, compliance with the home program was positively correlated with change in motor proficiency (ρ=0.680, P=0.030) and composite equilibrium score (ρ=0.638, P=0.047). The core stability exercise program is as effective as task-oriented training in improving motor proficiency among children with DCD. © The Author(s) 2014.
Stanford Univ., CA. Dept. of Communication.
A 21 minute sound color film was produced in an attempt to help teachers and educational specialists see people like themselves helping youngsters become more task oriented by the use of operant conditioning and modeling procedures formulated by Skinner (1953) and Bandura (1964). The unstructured and unrehearsed film consists of edited film…
Tabachnick, Sharon E.; Miller, Raymond B.; Relyea, George E.
2008-01-01
The authors performed path analysis, followed by a bootstrap procedure, to test the predictions of a model explaining the relationships among students' distal future goals (both extrinsic and intrinsic), their adoption of a middle-range subgoal, their perceptions of task instrumentality, and their proximal task-oriented self-regulation strategies.…
Kwon, Hae-Yeon; Ahn, So-Yoon
2016-10-01
[Purpose] This study investigates how a task-oriented training and high-variability practice program can affect the gross motor performance and activities of daily living for children with spastic diplegia and provides an effective and reliable clinical database for future improvement of motor performances skills. [Subjects and Methods] This study randomly assigned seven children with spastic diplegia to each intervention group including that of a control group, task-oriented training group, and a high-variability practice group. The control group only received neurodevelopmental treatment for 40 minutes, while the other two intervention groups additionally implemented a task-oriented training and high-variability practice program for 8 weeks (twice a week, 60 min per session). To compare intra and inter-relationships of the three intervention groups, this study measured gross motor performance measure (GMPM) and functional independence measure for children (WeeFIM) before and after 8 weeks of training. [Results] There were statistically significant differences in the amount of change before and after the training among the three intervention groups for the gross motor performance measure and functional independence measure. [Conclusion] Applying high-variability practice in a task-oriented training course may be considered an efficient intervention method to improve motor performance skills that can tune to movement necessary for daily livelihood through motor experience and learning of new skills as well as change of tasks learned in a complex environment or similar situations to high-variability practice.
Directory of Open Access Journals (Sweden)
Roelse Hanneke
2009-08-01
Full Text Available Abstract Background Most patients who suffer a stroke experience reduced walking competency and health-related quality of life (HRQoL. A key factor in effective stroke rehabilitation is intensive, task-specific training. Recent studies suggest that intensive, patient-tailored training can be organized as a circuit with a series of task-oriented workstations. Primary aim of the FIT-Stroke trial is to evaluate the effects and cost-effectiveness of a structured, progressive task-oriented circuit class training (CCT programme, compared to usual physiotherapeutic care during outpatient rehabilitation in a rehabilitation centre. The task-oriented CCT will be applied in groups of 4 to 6 patients. Outcome will be defined in terms of gait and gait-related ADLs after stroke. The trial will also investigate the generalizability of treatment effects of task-oriented CCT in terms of perceived fatigue, anxiety, depression and perceived HRQoL. Methods/design The multicentre single-blinded randomized trial will include 220 stroke patients discharged to the community from inpatient rehabilitation, who are able to communicate and walk at least 10 m without physical, hands-on assistance. After discharge from inpatient rehabilitation, patients in the experimental group will receive task-oriented CCT two times a week for 12 weeks at the physiotherapy department of the rehabilitation centre. Control group patients will receive usual individual, face-to-face, physiotherapy. Costs will be evaluated by having each patient keep a cost diary for the first 24 weeks after randomisation. Primary outcomes are the mobility part of the Stroke Impact Scale (SIS-3.0 and the EuroQol. Secondary outcomes are the other domains of SIS-3.0, lower limb muscle strength, walking endurance, gait speed, balance, confidence not to fall, instrumental ADL, fatigue, anxiety, depression and HRQoL. Discussion Based on assumptions about the effect of intensity of practice and specificity of
A scalable distributed RRT for motion planning
Jacobs, Sam Ade
2013-05-01
Rapidly-exploring Random Tree (RRT), like other sampling-based motion planning methods, has been very successful in solving motion planning problems. Even so, sampling-based planners cannot solve all problems of interest efficiently, so attention is increasingly turning to parallelizing them. However, one challenge in parallelizing RRT is the global computation and communication overhead of nearest neighbor search, a key operation in RRTs. This is a critical issue as it limits the scalability of previous algorithms. We present two parallel algorithms to address this problem. The first algorithm extends existing work by introducing a parameter that adjusts how much local computation is done before a global update. The second algorithm radially subdivides the configuration space into regions, constructs a portion of the tree in each region in parallel, and connects the subtrees,i removing cycles if they exist. By subdividing the space, we increase computation locality enabling a scalable result. We show that our approaches are scalable. We present results demonstrating almost linear scaling to hundreds of processors on a Linux cluster and a Cray XE6 machine. © 2013 IEEE.
A scalable distributed RRT for motion planning
Jacobs, Sam Ade; Stradford, Nicholas; Rodriguez, Cesar; Thomas, Shawna; Amato, Nancy M.
2013-01-01
Rapidly-exploring Random Tree (RRT), like other sampling-based motion planning methods, has been very successful in solving motion planning problems. Even so, sampling-based planners cannot solve all problems of interest efficiently, so attention is increasingly turning to parallelizing them. However, one challenge in parallelizing RRT is the global computation and communication overhead of nearest neighbor search, a key operation in RRTs. This is a critical issue as it limits the scalability of previous algorithms. We present two parallel algorithms to address this problem. The first algorithm extends existing work by introducing a parameter that adjusts how much local computation is done before a global update. The second algorithm radially subdivides the configuration space into regions, constructs a portion of the tree in each region in parallel, and connects the subtrees,i removing cycles if they exist. By subdividing the space, we increase computation locality enabling a scalable result. We show that our approaches are scalable. We present results demonstrating almost linear scaling to hundreds of processors on a Linux cluster and a Cray XE6 machine. © 2013 IEEE.
Design for scalability in 3D computer graphics architectures
DEFF Research Database (Denmark)
Holten-Lund, Hans Erik
2002-01-01
This thesis describes useful methods and techniques for designing scalable hybrid parallel rendering architectures for 3D computer graphics. Various techniques for utilizing parallelism in a pipelines system are analyzed. During the Ph.D study a prototype 3D graphics architecture named Hybris has...
Scalable Atomistic Simulation Algorithms for Materials Research
Directory of Open Access Journals (Sweden)
Aiichiro Nakano
2002-01-01
Full Text Available A suite of scalable atomistic simulation programs has been developed for materials research based on space-time multiresolution algorithms. Design and analysis of parallel algorithms are presented for molecular dynamics (MD simulations and quantum-mechanical (QM calculations based on the density functional theory. Performance tests have been carried out on 1,088-processor Cray T3E and 1,280-processor IBM SP3 computers. The linear-scaling algorithms have enabled 6.44-billion-atom MD and 111,000-atom QM calculations on 1,024 SP3 processors with parallel efficiency well over 90%. production-quality programs also feature wavelet-based computational-space decomposition for adaptive load balancing, spacefilling-curve-based adaptive data compression with user-defined error bound for scalable I/O, and octree-based fast visibility culling for immersive and interactive visualization of massive simulation data.
Scalable and balanced dynamic hybrid data assimilation
Kauranne, Tuomo; Amour, Idrissa; Gunia, Martin; Kallio, Kari; Lepistö, Ahti; Koponen, Sampsa
2017-04-01
Scalability of complex weather forecasting suites is dependent on the technical tools available for implementing highly parallel computational kernels, but to an equally large extent also on the dependence patterns between various components of the suite, such as observation processing, data assimilation and the forecast model. Scalability is a particular challenge for 4D variational assimilation methods that necessarily couple the forecast model into the assimilation process and subject this combination to an inherently serial quasi-Newton minimization process. Ensemble based assimilation methods are naturally more parallel, but large models force ensemble sizes to be small and that results in poor assimilation accuracy, somewhat akin to shooting with a shotgun in a million-dimensional space. The Variational Ensemble Kalman Filter (VEnKF) is an ensemble method that can attain the accuracy of 4D variational data assimilation with a small ensemble size. It achieves this by processing a Gaussian approximation of the current error covariance distribution, instead of a set of ensemble members, analogously to the Extended Kalman Filter EKF. Ensemble members are re-sampled every time a new set of observations is processed from a new approximation of that Gaussian distribution which makes VEnKF a dynamic assimilation method. After this a smoothing step is applied that turns VEnKF into a dynamic Variational Ensemble Kalman Smoother VEnKS. In this smoothing step, the same process is iterated with frequent re-sampling of the ensemble but now using past iterations as surrogate observations until the end result is a smooth and balanced model trajectory. In principle, VEnKF could suffer from similar scalability issues as 4D-Var. However, this can be avoided by isolating the forecast model completely from the minimization process by implementing the latter as a wrapper code whose only link to the model is calling for many parallel and totally independent model runs, all of them
Cha, Hyun-Gyu; Oh, Duck-Won
2016-03-01
This study aimed to explore the effects of mirror therapy integrated with task-oriented exercise on balance function in poststroke hemiparesis. Twenty patients with poststroke hemiparesis were assigned randomly to an experimental group (EG) and a control group (CG), with 10 individuals each. Participants of the EG and CG received a task-oriented exercise program with a focus on the strengthening of the lower limb and the practice of balance-related functional tasks. An additional option for the EG was front and side wall mirrors to provide visual feedback for their own movements while performing the exercise. The program was performed for 30 min, twice a day, five times per week for 4 weeks. Outcome measures included the Berg balance scale, the timed up-and-go test, and quantitative data (balance index and dynamic limits of stability). In the EG and CG, all variables showed significant differences between pretest and post-test (Phemiparesis.
Scalable conditional induction variables (CIV) analysis
DEFF Research Database (Denmark)
Oancea, Cosmin Eugen; Rauchwerger, Lawrence
2015-01-01
parallelizing compiler and evaluated its impact on five Fortran benchmarks. We have found that that there are many important loops using CIV subscripts and that our analysis can lead to their scalable parallelization. This in turn has led to the parallelization of the benchmark programs they appear in.......Subscripts using induction variables that cannot be expressed as a formula in terms of the enclosing-loop indices appear in the low-level implementation of common programming abstractions such as filter, or stack operations and pose significant challenges to automatic parallelization. Because...... the complexity of such induction variables is often due to their conditional evaluation across the iteration space of loops we name them Conditional Induction Variables (CIV). This paper presents a flow-sensitive technique that summarizes both such CIV-based and affine subscripts to program level, using the same...
Parallel hierarchical radiosity rendering
Energy Technology Data Exchange (ETDEWEB)
Carter, Michael [Iowa State Univ., Ames, IA (United States)
1993-07-01
In this dissertation, the step-by-step development of a scalable parallel hierarchical radiosity renderer is documented. First, a new look is taken at the traditional radiosity equation, and a new form is presented in which the matrix of linear system coefficients is transformed into a symmetric matrix, thereby simplifying the problem and enabling a new solution technique to be applied. Next, the state-of-the-art hierarchical radiosity methods are examined for their suitability to parallel implementation, and scalability. Significant enhancements are also discovered which both improve their theoretical foundations and improve the images they generate. The resultant hierarchical radiosity algorithm is then examined for sources of parallelism, and for an architectural mapping. Several architectural mappings are discussed. A few key algorithmic changes are suggested during the process of making the algorithm parallel. Next, the performance, efficiency, and scalability of the algorithm are analyzed. The dissertation closes with a discussion of several ideas which have the potential to further enhance the hierarchical radiosity method, or provide an entirely new forum for the application of hierarchical methods.
Slagell, Adam J; Bonilla, Rafael
2004-01-01
This report surveys different PKI technologies such as PKIX and SPKI and the issues of PKI that affect scalability. Much focus is spent on certificate revocation methodologies and status verification systems such as CRLs, Delta-CRLs, CRS, Certificate Revocation Trees, Windowed Certificate Revocation, OCSP, SCVP and DVCS.
Engineering-Based Thermal CFD Simulations on Massive Parallel Systems
Frisch, Jé rô me; Mundani, Ralf-Peter; Rank, Ernst; van Treeck, Christoph
2015-01-01
The development of parallel Computational Fluid Dynamics (CFD) codes is a challenging task that entails efficient parallelization concepts and strategies in order to achieve good scalability values when running those codes on modern supercomputers
Scalable Performance Measurement and Analysis
Energy Technology Data Exchange (ETDEWEB)
Gamblin, Todd [Univ. of North Carolina, Chapel Hill, NC (United States)
2009-01-01
Concurrency levels in large-scale, distributed-memory supercomputers are rising exponentially. Modern machines may contain 100,000 or more microprocessor cores, and the largest of these, IBM's Blue Gene/L, contains over 200,000 cores. Future systems are expected to support millions of concurrent tasks. In this dissertation, we focus on efficient techniques for measuring and analyzing the performance of applications running on very large parallel machines. Tuning the performance of large-scale applications can be a subtle and time-consuming task because application developers must measure and interpret data from many independent processes. While the volume of the raw data scales linearly with the number of tasks in the running system, the number of tasks is growing exponentially, and data for even small systems quickly becomes unmanageable. Transporting performance data from so many processes over a network can perturb application performance and make measurements inaccurate, and storing such data would require a prohibitive amount of space. Moreover, even if it were stored, analyzing the data would be extremely time-consuming. In this dissertation, we present novel methods for reducing performance data volume. The first draws on multi-scale wavelet techniques from signal processing to compress systemwide, time-varying load-balance data. The second uses statistical sampling to select a small subset of running processes to generate low-volume traces. A third approach combines sampling and wavelet compression to stratify performance data adaptively at run-time and to reduce further the cost of sampled tracing. We have integrated these approaches into Libra, a toolset for scalable load-balance analysis. We present Libra and show how it can be used to analyze data from large scientific applications scalably.
Scalable Algorithms for Adaptive Statistical Designs
Directory of Open Access Journals (Sweden)
Robert Oehmke
2000-01-01
Full Text Available We present a scalable, high-performance solution to multidimensional recurrences that arise in adaptive statistical designs. Adaptive designs are an important class of learning algorithms for a stochastic environment, and we focus on the problem of optimally assigning patients to treatments in clinical trials. While adaptive designs have significant ethical and cost advantages, they are rarely utilized because of the complexity of optimizing and analyzing them. Computational challenges include massive memory requirements, few calculations per memory access, and multiply-nested loops with dynamic indices. We analyze the effects of various parallelization options, and while standard approaches do not work well, with effort an efficient, highly scalable program can be developed. This allows us to solve problems thousands of times more complex than those solved previously, which helps make adaptive designs practical. Further, our work applies to many other problems involving neighbor recurrences, such as generalized string matching.
Scalable Resolution Display Walls
Leigh, Jason; Johnson, Andrew; Renambot, Luc; Peterka, Tom; Jeong, Byungil; Sandin, Daniel J.; Talandis, Jonas; Jagodic, Ratko; Nam, Sungwon; Hur, Hyejung; Sun, Yiwen
2013-01-01
This article will describe the progress since 2000 on research and development in 2-D and 3-D scalable resolution display walls that are built from tiling individual lower resolution flat panel displays. The article will describe approaches and trends in display hardware construction, middleware architecture, and user-interaction design. The article will also highlight examples of use cases and the benefits the technology has brought to their respective disciplines. © 1963-2012 IEEE.
SPINning parallel systems software
International Nuclear Information System (INIS)
Matlin, O.S.; Lusk, E.; McCune, W.
2002-01-01
We describe our experiences in using Spin to verify parts of the Multi Purpose Daemon (MPD) parallel process management system. MPD is a distributed collection of processes connected by Unix network sockets. MPD is dynamic processes and connections among them are created and destroyed as MPD is initialized, runs user processes, recovers from faults, and terminates. This dynamic nature is easily expressible in the Spin/Promela framework but poses performance and scalability challenges. We present here the results of expressing some of the parallel algorithms of MPD and executing both simulation and verification runs with Spin
Parallel auto-correlative statistics with VTK.
Energy Technology Data Exchange (ETDEWEB)
Pebay, Philippe Pierre; Bennett, Janine Camille
2013-08-01
This report summarizes existing statistical engines in VTK and presents both the serial and parallel auto-correlative statistics engines. It is a sequel to [PT08, BPRT09b, PT09, BPT09, PT10] which studied the parallel descriptive, correlative, multi-correlative, principal component analysis, contingency, k-means, and order statistics engines. The ease of use of the new parallel auto-correlative statistics engine is illustrated by the means of C++ code snippets and algorithm verification is provided. This report justifies the design of the statistics engines with parallel scalability in mind, and provides scalability and speed-up analysis results for the autocorrelative statistics engine.
Scalable Simulation of Electromagnetic Hybrid Codes
International Nuclear Information System (INIS)
Perumalla, Kalyan S.; Fujimoto, Richard; Karimabadi, Dr. Homa
2006-01-01
New discrete-event formulations of physics simulation models are emerging that can outperform models based on traditional time-stepped techniques. Detailed simulation of the Earth's magnetosphere, for example, requires execution of sub-models that are at widely differing timescales. In contrast to time-stepped simulation which requires tightly coupled updates to entire system state at regular time intervals, the new discrete event simulation (DES) approaches help evolve the states of sub-models on relatively independent timescales. However, parallel execution of DES-based models raises challenges with respect to their scalability and performance. One of the key challenges is to improve the computation granularity to offset synchronization and communication overheads within and across processors. Our previous work was limited in scalability and runtime performance due to the parallelization challenges. Here we report on optimizations we performed on DES-based plasma simulation models to improve parallel performance. The net result is the capability to simulate hybrid particle-in-cell (PIC) models with over 2 billion ion particles using 512 processors on supercomputing platforms
Massively Parallel Finite Element Programming
Heister, Timo
2010-01-01
Today\\'s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
Massively Parallel Finite Element Programming
Heister, Timo; Kronbichler, Martin; Bangerth, Wolfgang
2010-01-01
Today's large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers in generic finite element codes. Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is a limiting factor when solving on more than a few hundreds of cores. We describe routines for distributed storage of all major components coupled with efficient, scalable algorithms. We give an overview of our effort to enable the modern and generic finite element library deal.II to take advantage of the power of large clusters. In particular, we describe the construction of a distributed mesh and develop algorithms to fully parallelize the finite element calculation. Numerical results demonstrate good scalability. © 2010 Springer-Verlag.
Scalable optical quantum computer
Energy Technology Data Exchange (ETDEWEB)
Manykin, E A; Mel' nichenko, E V [Institute for Superconductivity and Solid-State Physics, Russian Research Centre ' Kurchatov Institute' , Moscow (Russian Federation)
2014-12-31
A way of designing a scalable optical quantum computer based on the photon echo effect is proposed. Individual rare earth ions Pr{sup 3+}, regularly located in the lattice of the orthosilicate (Y{sub 2}SiO{sub 5}) crystal, are suggested to be used as optical qubits. Operations with qubits are performed using coherent and incoherent laser pulses. The operation protocol includes both the method of measurement-based quantum computations and the technique of optical computations. Modern hybrid photon echo protocols, which provide a sufficient quantum efficiency when reading recorded states, are considered as most promising for quantum computations and communications. (quantum computer)
Scalable optical quantum computer
International Nuclear Information System (INIS)
Manykin, E A; Mel'nichenko, E V
2014-01-01
A way of designing a scalable optical quantum computer based on the photon echo effect is proposed. Individual rare earth ions Pr 3+ , regularly located in the lattice of the orthosilicate (Y 2 SiO 5 ) crystal, are suggested to be used as optical qubits. Operations with qubits are performed using coherent and incoherent laser pulses. The operation protocol includes both the method of measurement-based quantum computations and the technique of optical computations. Modern hybrid photon echo protocols, which provide a sufficient quantum efficiency when reading recorded states, are considered as most promising for quantum computations and communications. (quantum computer)
Scalable photoreactor for hydrogen production
Takanabe, Kazuhiro; Shinagawa, Tatsuya
2017-01-01
Provided herein are scalable photoreactors that can include a membrane-free water- splitting electrolyzer and systems that can include a plurality of membrane-free water- splitting electrolyzers. Also provided herein are methods of using the scalable photoreactors provided herein.
Scalable photoreactor for hydrogen production
Takanabe, Kazuhiro
2017-04-06
Provided herein are scalable photoreactors that can include a membrane-free water- splitting electrolyzer and systems that can include a plurality of membrane-free water- splitting electrolyzers. Also provided herein are methods of using the scalable photoreactors provided herein.
Highly scalable Ab initio genomic motif identification
Marchand, Benoit; Bajic, Vladimir B.; Kaushik, Dinesh
2011-01-01
We present results of scaling an ab initio motif family identification system, Dragon Motif Finder (DMF), to 65,536 processor cores of IBM Blue Gene/P. DMF seeks groups of mutually similar polynucleotide patterns within a set of genomic sequences and builds various motif families from them. Such information is of relevance to many problems in life sciences. Prior attempts to scale such ab initio motif-finding algorithms achieved limited success. We solve the scalability issues using a combination of mixed-mode MPI-OpenMP parallel programming, master-slave work assignment, multi-level workload distribution, multi-level MPI collectives, and serial optimizations. While the scalability of our algorithm was excellent (94% parallel efficiency on 65,536 cores relative to 256 cores on a modest-size problem), the final speedup with respect to the original serial code exceeded 250,000 when serial optimizations are included. This enabled us to carry out many large-scale ab initio motiffinding simulations in a few hours while the original serial code would have needed decades of execution time. Copyright 2011 ACM.
SuperLU{_}DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems
Energy Technology Data Exchange (ETDEWEB)
Li, Xiaoye S.; Demmel, James W.
2002-03-27
In this paper, we present the main algorithmic features in the software package SuperLU{_}DIST, a distributed-memory sparse direct solver for large sets of linear equations. We give in detail our parallelization strategies, with focus on scalability issues, and demonstrate the parallel performance and scalability on current machines. The solver is based on sparse Gaussian elimination, with an innovative static pivoting strategy proposed earlier by the authors. The main advantage of static pivoting over classical partial pivoting is that it permits a priori determination of data structures and communication pattern for sparse Gaussian elimination, which makes it more scalable on distributed memory machines. Based on this a priori knowledge, we designed highly parallel and scalable algorithms for both LU decomposition and triangular solve and we show that they are suitable for large-scale distributed memory machines.
Distance-two interpolation for parallel algebraic multigrid
International Nuclear Information System (INIS)
Sterck, H de; Falgout, R D; Nolting, J W; Yang, U M
2007-01-01
In this paper we study the use of long distance interpolation methods with the low complexity coarsening algorithm PMIS. AMG performance and scalability is compared for classical as well as long distance interpolation methods on parallel computers. It is shown that the increased interpolation accuracy largely restores the scalability of AMG convergence factors for PMIS-coarsened grids, and in combination with complexity reducing methods, such as interpolation truncation, one obtains a class of parallel AMG methods that enjoy excellent scalability properties on large parallel computers
Scalable conditional induction variables (CIV) analysis
Oancea, Cosmin E.
2015-02-01
Subscripts using induction variables that cannot be expressed as a formula in terms of the enclosing-loop indices appear in the low-level implementation of common programming abstractions such as Alter, or stack operations and pose significant challenges to automatic parallelization. Because the complexity of such induction variables is often due to their conditional evaluation across the iteration space of loops we name them Conditional Induction Variables (CIV). This paper presents a flow-sensitive technique that summarizes both such CIV-based and affine subscripts to program level, using the same representation. Our technique requires no modifications of our dependence tests, which is agnostic to the original shape of the subscripts, and is more powerful than previously reported dependence tests that rely on the pairwise disambiguation of read-write references. We have implemented the CIV analysis in our parallelizing compiler and evaluated its impact on five Fortran benchmarks. We have found that that there are many important loops using CIV subscripts and that our analysis can lead to their scalable parallelization. This in turn has led to the parallelization of the benchmark programs they appear in.
Scalable Nanomanufacturing—A Review
Directory of Open Access Journals (Sweden)
Khershed Cooper
2017-01-01
Full Text Available This article describes the field of scalable nanomanufacturing, its importance and need, its research activities and achievements. The National Science Foundation is taking a leading role in fostering basic research in scalable nanomanufacturing (SNM. From this effort several novel nanomanufacturing approaches have been proposed, studied and demonstrated, including scalable nanopatterning. This paper will discuss SNM research areas in materials, processes and applications, scale-up methods with project examples, and manufacturing challenges that need to be addressed to move nanotechnology discoveries closer to the marketplace.
Scalable Automated Model Search
2014-05-20
profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on...minimization. The computer journal, 7(4):308–313, 1965. [31] K. Ousterhout, A. Panda , J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker...and I. Stoica. The case for tiny tasks in compute clusters. [32] B. Panda , J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: massively parallel
Liu, Tai-Wa; Ng, Gabriel Y F; Ng, Shamay S M
2018-03-07
The consequences of falls are devastating for patients with stroke. Balance problems and fear of falling are two major challenges, and recent systematic reviews have revealed that habitual physical exercise training alone cannot reduce the occurrence of falls in stroke survivors. However, recent trials with community-dwelling healthy older adults yielded the promising result that interventions with a cognitive behavioral therapy (CBT) component can simultaneously promote balance and reduce the fear of falling. Therefore, the aim of the proposed clinical trial is to evaluate the effectiveness of a combination of CBT and task-oriented balance training (TOBT) in promoting subjective balance confidence, and thereby reducing fear-avoidance behavior, improving balance ability, reducing fall risk, and promoting independent living, community reintegration, and health-related quality of life of patients with stroke. The study will constitute a placebo-controlled single-blind parallel-group randomized controlled trial in which patients are assessed immediately, at 3 months, and at 12 months. The selected participants will be randomly allocated into one of two parallel groups (the experimental group and the control group) with a 1:1 ratio. Both groups will receive 45 min of TOBT twice per week for 8 weeks. In addition, the experimental group will receive a 45-min CBT-based group intervention, and the control group will receive 45 min of general health education (GHE) twice per week for 8 weeks. The primary outcome measure is subjective balance confidence. The secondary outcome measures are fear-avoidance behavior, balance ability, fall risk, level of activities of daily living, community reintegration, and health-related quality of life. The proposed clinical trial will compare the effectiveness of CBT combined with TOBT and GHE combined with TOBT in promoting subjective balance confidence among chronic stroke patients. We hope our results will provide evidence of a safe
Scalable Parallelization of Skyline Computation for Multi-core Processors
DEFF Research Database (Denmark)
Chester, Sean; Sidlauskas, Darius; Assent, Ira
2015-01-01
The skyline is an important query operator for multi-criteria decision making. It reduces a dataset to only those points that offer optimal trade-offs of dimensions. In general, it is very expensive to compute. Recently, multi-core CPU algorithms have been proposed to accelerate the computation...... of the skyline. However, they do not sufficiently minimize dominance tests and so are not competitive with state-of-the-art sequential algorithms. In this paper, we introduce a novel multi-core skyline algorithm, Hybrid, which processes points in blocks. It maintains a shared, global skyline among all threads...
Parallel and Scalable Sparse Basic Linear Algebra Subprograms
DEFF Research Database (Denmark)
Liu, Weifeng
and heterogeneous processors. The thesis compares the proposed methods with state-of-the-art approaches on six homogeneous and five heterogeneous processors from Intel, AMD and nVidia. Using in total 38 sparse matrices as a benchmark suite, the experimental results show that the proposed methods obtain significant...
Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures
National Research Council Canada - National Science Library
Grama, Ananth; Manguoglu, Murat; Koyuturk, Mehmet; Naumov, Maxim; Sameh, Ahmed
2008-01-01
.... The study was motivated primarily by the lack of robustness of Krylov subspace iterative schemes with generic, black-box, pre-conditioners such as approximate (or incomplete) LU-factorizations...
Long-range interactions and parallel scalability in molecular simulations
Patra, M.; Hyvönen, M.T.; Falck, E.; Sabouri-Ghomi, M.; Vattulainen, I.; Karttunen, M.E.J.
2007-01-01
Typical biomolecular systems such as cellular membranes, DNA, and protein complexes are highly charged. Thus, efficient and accurate treatment of electrostatic interactions is of great importance in computational modeling of such systems. We have employed the GROMACS simulation package to perform
Parallelization of quantum molecular dynamics simulation code
International Nuclear Information System (INIS)
Kato, Kaori; Kunugi, Tomoaki; Shibahara, Masahiko; Kotake, Susumu
1998-02-01
A quantum molecular dynamics simulation code has been developed for the analysis of the thermalization of photon energies in the molecule or materials in Kansai Research Establishment. The simulation code is parallelized for both Scalar massively parallel computer (Intel Paragon XP/S75) and Vector parallel computer (Fujitsu VPP300/12). Scalable speed-up has been obtained with a distribution to processor units by division of particle group in both parallel computers. As a result of distribution to processor units not only by particle group but also by the particles calculation that is constructed with fine calculations, highly parallelization performance is achieved in Intel Paragon XP/S75. (author)
PSHED: a simplified approach to developing parallel programs
International Nuclear Information System (INIS)
Mahajan, S.M.; Ramesh, K.; Rajesh, K.; Somani, A.; Goel, M.
1992-01-01
This paper presents a simplified approach in the forms of a tree structured computational model for parallel application programs. An attempt is made to provide a standard user interface to execute programs on BARC Parallel Processing System (BPPS), a scalable distributed memory multiprocessor. The interface package called PSHED provides a basic framework for representing and executing parallel programs on different parallel architectures. The PSHED package incorporates concepts from a broad range of previous research in programming environments and parallel computations. (author). 6 refs
A parallel 2-opt algorithm for the traveling salesman problem
Verhoeven, M.G.A.; Aarts, E.H.L.; Swinkels, P.C.J.
1995-01-01
We present a scalable parallel local search algorithm based on data parallelism. The concept of distributed neighborhood structures is introduced, and applied to the Traveling Salesman Problem (TSP). Our parallel local search algorithm finds the same quality solutions as the classical 2-opt
Scalable fast multipole accelerated vortex methods
Hu, Qi
2014-05-01
The fast multipole method (FMM) is often used to accelerate the calculation of particle interactions in particle-based methods to simulate incompressible flows. To evaluate the most time-consuming kernels - the Biot-Savart equation and stretching term of the vorticity equation, we mathematically reformulated it so that only two Laplace scalar potentials are used instead of six. This automatically ensuring divergence-free far-field computation. Based on this formulation, we developed a new FMM-based vortex method on heterogeneous architectures, which distributed the work between multicore CPUs and GPUs to best utilize the hardware resources and achieve excellent scalability. The algorithm uses new data structures which can dynamically manage inter-node communication and load balance efficiently, with only a small parallel construction overhead. This algorithm can scale to large-sized clusters showing both strong and weak scalability. Careful error and timing trade-off analysis are also performed for the cutoff functions induced by the vortex particle method. Our implementation can perform one time step of the velocity+stretching calculation for one billion particles on 32 nodes in 55.9 seconds, which yields 49.12 Tflop/s.
Implementation of a high performance parallel finite element micromagnetics package
International Nuclear Information System (INIS)
Scholz, W.; Suess, D.; Dittrich, R.; Schrefl, T.; Tsiantos, V.; Forster, H.; Fidler, J.
2004-01-01
A new high performance scalable parallel finite element micromagnetics package has been implemented. It includes solvers for static energy minimization, time integration of the Landau-Lifshitz-Gilbert equation, and the nudged elastic band method
iSIGHT-FD scalability test report.
Energy Technology Data Exchange (ETDEWEB)
Clay, Robert L.; Shneider, Max S.
2008-07-01
The engineering analysis community at Sandia National Laboratories uses a number of internal and commercial software codes and tools, including mesh generators, preprocessors, mesh manipulators, simulation codes, post-processors, and visualization packages. We define an analysis workflow as the execution of an ordered, logical sequence of these tools. Various forms of analysis (and in particular, methodologies that use multiple function evaluations or samples) involve executing parameterized variations of these workflows. As part of the DART project, we are evaluating various commercial workflow management systems, including iSIGHT-FD from Engineous. This report documents the results of a scalability test that was driven by DAKOTA and conducted on a parallel computer (Thunderbird). The purpose of this experiment was to examine the suitability and performance of iSIGHT-FD for large-scale, parameterized analysis workflows. As the results indicate, we found iSIGHT-FD to be suitable for this type of application.
Directory of Open Access Journals (Sweden)
Sarah Prasasti
2001-01-01
Full Text Available The method presented here provides the basis for a course in American prose for EFL students. Understanding and appreciation of American prose is a difficult task for the students because they come into contact with works that are full of cultural baggage and far apart from their own world. The audio visual aid is one of the alternatives to sensitize the students to the topic and the cultural background. Instead of proving the ready-made audio visual aids, teachers can involve students to actively engage in a more task oriented audiovisual project. Here, the teachers encourage their students to create their own audio visual aids using colors, pictures, sound and gestures as a point of initiation for further discussion. The students can use color that has become a strong element of fiction to help them calling up a forceful visual representation. Pictures can also stimulate the students to build their mental image. Sound and silence, which are a part of the fabric of literature, may also help them to increase the emotional impact.
Myria: Scalable Analytics as a Service
Howe, B.; Halperin, D.; Whitaker, A.
2014-12-01
At the UW eScience Institute, we're working to empower non-experts, especially in the sciences, to write and use data-parallel algorithms. To this end, we are building Myria, a web-based platform for scalable analytics and data-parallel programming. Myria's internal model of computation is the relational algebra extended with iteration, such that every program is inherently data-parallel, just as every query in a database is inherently data-parallel. But unlike databases, iteration is a first class concept, allowing us to express machine learning tasks, graph traversal tasks, and more. Programs can be expressed in a number of languages and can be executed on a number of execution environments, but we emphasize a particular language called MyriaL that supports both imperative and declarative styles and a particular execution engine called MyriaX that uses an in-memory column-oriented representation and asynchronous iteration. We deliver Myria over the web as a service, providing an editor, performance analysis tools, and catalog browsing features in a single environment. We find that this web-based "delivery vector" is critical in reaching non-experts: they are insulated from irrelevant effort technical work associated with installation, configuration, and resource management. The MyriaX backend, one of several execution runtimes we support, is a main-memory, column-oriented, RDBMS-on-the-worker system that supports cyclic data flows as a first-class citizen and has been shown to outperform competitive systems on 100-machine cluster sizes. I will describe the Myria system, give a demo, and present some new results in large-scale oceanographic microbiology.
Scalable cloud without dedicated storage
Batkovich, D. V.; Kompaniets, M. V.; Zarochentsev, A. K.
2015-05-01
We present a prototype of a scalable computing cloud. It is intended to be deployed on the basis of a cluster without the separate dedicated storage. The dedicated storage is replaced by the distributed software storage. In addition, all cluster nodes are used both as computing nodes and as storage nodes. This solution increases utilization of the cluster resources as well as improves fault tolerance and performance of the distributed storage. Another advantage of this solution is high scalability with a relatively low initial and maintenance cost. The solution is built on the basis of the open source components like OpenStack, CEPH, etc.
The STAPL Parallel Graph Library
Harshvardhan,
2013-01-01
This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable distributed graph container and a collection of commonly used parallel graph algorithms. The library introduces pGraph pViews that separate algorithm design from the container implementation. It supports three graph processing algorithmic paradigms, level-synchronous, asynchronous and coarse-grained, and provides common graph algorithms based on them. Experimental results demonstrate improved scalability in performance and data size over existing graph libraries on more than 16,000 cores and on internet-scale graphs containing over 16 billion vertices and 250 billion edges. © Springer-Verlag Berlin Heidelberg 2013.
Parallel Algorithms for the Exascale Era
Energy Technology Data Exchange (ETDEWEB)
Robey, Robert W. [Los Alamos National Laboratory
2016-10-19
New parallel algorithms are needed to reach the Exascale level of parallelism with millions of cores. We look at some of the research developed by students in projects at LANL. The research blends ideas from the early days of computing while weaving in the fresh approach brought by students new to the field of high performance computing. We look at reproducibility of global sums and why it is important to parallel computing. Next we look at how the concept of hashing has led to the development of more scalable algorithms suitable for next-generation parallel computers. Nearly all of this work has been done by undergraduates and published in leading scientific journals.
Asynchronous Checkpoint Migration with MRNet in the Scalable Checkpoint / Restart Library
Energy Technology Data Exchange (ETDEWEB)
Mohror, K; Moody, A; de Supinski, B R
2012-03-20
Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint/Restart Library (SCR). We employ MRNet, a tree-based overlay network library, to transfer checkpoints from the compute nodes to the parallel file system asynchronously. This enhancement increases application efficiency by removing the need for an application to block while checkpoints are transferred to the parallel file system. We show that the integration of SCR with MRNet can reduce the time spent in I/O operations by as much as 15x. However, our experiments exposed new scalability issues with our initial implementation. We discuss the sources of the scalability problems and our plans to address them.
Freeman, Bryan
2013-01-01
This book contains practical recipes on everything you will need to create task-based parallel programs using C#, .NET 4.5, and Visual Studio. The book is packed with illustrated code examples to create scalable programs.This book is intended to help experienced C# developers write applications that leverage the power of modern multicore processors. It provides the necessary knowledge for an experienced C# developer to work with .NET parallelism APIs. Previous experience of writing multithreaded applications is not necessary.
1982-01-01
Parallel Computations focuses on parallel computation, with emphasis on algorithms used in a variety of numerical and physical applications and for many different types of parallel computers. Topics covered range from vectorization of fast Fourier transforms (FFTs) and of the incomplete Cholesky conjugate gradient (ICCG) algorithm on the Cray-1 to calculation of table lookups and piecewise functions. Single tridiagonal linear systems and vectorized computation of reactive flow are also discussed.Comprised of 13 chapters, this volume begins by classifying parallel computers and describing techn
Optimizing Nanoelectrode Arrays for Scalable Intracellular Electrophysiology.
Abbott, Jeffrey; Ye, Tianyang; Ham, Donhee; Park, Hongkun
2018-03-20
, clarifying how the nanoelectrode attains intracellular access. This understanding will be translated into a circuit model for the nanobio interface, which we will then use to lay out the strategies for improving the interface. The intracellular interface of the nanoelectrode is currently inferior to that of the patch clamp electrode; reaching this benchmark will be an exciting challenge that involves optimization of electrode geometries, materials, chemical modifications, electroporation protocols, and recording/stimulation electronics, as we describe in the Account. Another important theme of this Account, beyond the optimization of the individual nanoelectrode-cell interface, is the scalability of the nanoscale electrodes. We will discuss this theme using a recent development from our groups as an example, where an array of ca. 1000 nanoelectrode pixels fabricated on a CMOS integrated circuit chip performs parallel intracellular recording from a few hundreds of cardiomyocytes, which marks a new milestone in electrophysiology.
Towards a streaming model for nested data parallelism
DEFF Research Database (Denmark)
Madsen, Frederik Meisner; Filinski, Andrzej
2013-01-01
The language-integrated cost semantics for nested data parallelism pioneered by NESL provides an intuitive, high-level model for predicting performance and scalability of parallel algorithms with reasonable accuracy. However, this predictability, obtained through a uniform, parallelism-flattening......The language-integrated cost semantics for nested data parallelism pioneered by NESL provides an intuitive, high-level model for predicting performance and scalability of parallel algorithms with reasonable accuracy. However, this predictability, obtained through a uniform, parallelism......-processable in a streaming fashion. This semantics is directly compatible with previously proposed piecewise execution models for nested data parallelism, but allows the expected space usage to be reasoned about directly at the source-language level. The language definition and implementation are still very much work...
Jonsdottir, Johanna; Thorsen, Rune; Aprile, Irene; Galeri, Silvia; Spannocchi, Giovanna; Beghi, Ettore; Bianchi, Elisa; Montesano, Angelo; Ferrarin, Maurizio
2017-01-01
Motor recovery of persons after stroke may be enhanced by a novel approach where residual muscle activity is facilitated by patient-controlled electrical muscle activation. Myoelectric activity from hemiparetic muscles is then used for continuous control of functional electrical stimulation (MeCFES) of same or synergic muscles to promote restoration of movements during task-oriented therapy (TOT). Use of MeCFES during TOT may help to obtain a larger functional and neurological recovery than otherwise possible. Multicenter randomized controlled trial. Eighty two acute and chronic stroke victims were recruited through the collaborating facilities and after signing an informed consent were randomized to receive either the experimental (MeCFES assisted TOT (M-TOT) or conventional rehabilitation care including TOT (C-TOT). Both groups received 45 minutes of rehabilitation over 25 sessions. Outcomes were Action Research Arm Test (ARAT), Upper Extremity Fugl-Meyer Assessment (FMA-UE) scores and Disability of the Arm Shoulder and Hand questionnaire. Sixty eight subjects completed the protocol (Mean age 66.2, range 36.5-88.7, onset months 12.7, range 0.8-19.1) of which 45 were seen at follow up 5 weeks later. There were significant improvements in both groups on ARAT (median improvement: MeCFES TOT group 3.0; C-TOT group 2.0) and FMA-UE (median improvement: M-TOT 4.5; C-TOT 3.5). Considering subacute subjects (time since stroke rehabilitation (57.9%) than in the C-TOT group (33.2%) (difference in proportion improved 24.7%; 95% CI -4.0; 48.6), though the study did not meet the planned sample size. This is the first large multicentre RCT to compare MeCFES assisted TOT with conventional care TOT for the upper extremity. No adverse events or negative outcomes were encountered, thus we conclude that MeCFES can be a safe adjunct to rehabilitation that could promote recovery of upper limb function in persons after stroke, particularly when applied in the subacute phase.
Building a parallel file system simulator
International Nuclear Information System (INIS)
Molina-Estolano, E; Maltzahn, C; Brandt, S A; Bent, J
2009-01-01
Parallel file systems are gaining in popularity in high-end computing centers as well as commercial data centers. High-end computing systems are expected to scale exponentially and to pose new challenges to their storage scalability in terms of cost and power. To address these challenges scientists and file system designers will need a thorough understanding of the design space of parallel file systems. Yet there exist few systematic studies of parallel file system behavior at petabyte- and exabyte scale. An important reason is the significant cost of getting access to large-scale hardware to test parallel file systems. To contribute to this understanding we are building a parallel file system simulator that can simulate parallel file systems at very large scale. Our goal is to simulate petabyte-scale parallel file systems on a small cluster or even a single machine in reasonable time and fidelity. With this simulator, file system experts will be able to tune existing file systems for specific workloads, scientists and file system deployment engineers will be able to better communicate workload requirements, file system designers and researchers will be able to try out design alternatives and innovations at scale, and instructors will be able to study very large-scale parallel file system behavior in the class room. In this paper we describe our approach and provide preliminary results that are encouraging both in terms of fidelity and simulation scalability.
Casanova, Henri; Robert, Yves
2008-01-01
""…The authors of the present book, who have extensive credentials in both research and instruction in the area of parallelism, present a sound, principled treatment of parallel algorithms. … This book is very well written and extremely well designed from an instructional point of view. … The authors have created an instructive and fascinating text. The book will serve researchers as well as instructors who need a solid, readable text for a course on parallelism in computing. Indeed, for anyone who wants an understandable text from which to acquire a current, rigorous, and broad vi
Zhebel, E.; Minisini, S.; Kononov, A.; Mulder, W.A.
2013-01-01
With the rapid developments in parallel compute architectures, algorithms for seismic modeling and imaging need to be reconsidered in terms of parallelization. The aim of this paper is to compare scalability of seismic modeling algorithms: finite differences, continuous mass-lumped finite elements
A Massively Parallel Code for Polarization Calculations
Akiyama, Shizuka; Höflich, Peter
2001-03-01
We present an implementation of our Monte-Carlo radiation transport method for rapidly expanding, NLTE atmospheres for massively parallel computers which utilizes both the distributed and shared memory models. This allows us to take full advantage of the fast communication and low latency inherent to nodes with multiple CPUs, and to stretch the limits of scalability with the number of nodes compared to a version which is based on the shared memory model. Test calculations on a local 20-node Beowulf cluster with dual CPUs showed an improved scalability by about 40%.
Scalable inference for stochastic block models
Peng, Chengbin
2017-12-08
Community detection in graphs is widely used in social and biological networks, and the stochastic block model is a powerful probabilistic tool for describing graphs with community structures. However, in the era of "big data," traditional inference algorithms for such a model are increasingly limited due to their high time complexity and poor scalability. In this paper, we propose a multi-stage maximum likelihood approach to recover the latent parameters of the stochastic block model, in time linear with respect to the number of edges. We also propose a parallel algorithm based on message passing. Our algorithm can overlap communication and computation, providing speedup without compromising accuracy as the number of processors grows. For example, to process a real-world graph with about 1.3 million nodes and 10 million edges, our algorithm requires about 6 seconds on 64 cores of a contemporary commodity Linux cluster. Experiments demonstrate that the algorithm can produce high quality results on both benchmark and real-world graphs. An example of finding more meaningful communities is illustrated consequently in comparison with a popular modularity maximization algorithm.
Percolator: Scalable Pattern Discovery in Dynamic Graphs
Energy Technology Data Exchange (ETDEWEB)
Choudhury, Sutanay; Purohit, Sumit; Lin, Peng; Wu, Yinghui; Holder, Lawrence B.; Agarwal, Khushbu
2018-02-06
We demonstrate Percolator, a distributed system for graph pattern discovery in dynamic graphs. In contrast to conventional mining systems, Percolator advocates efficient pattern mining schemes that (1) support pattern detection with keywords; (2) integrate incremental and parallel pattern mining; and (3) support analytical queries such as trend analysis. The core idea of Percolator is to dynamically decide and verify a small fraction of patterns and their in- stances that must be inspected in response to buffered updates in dynamic graphs, with a total mining cost independent of graph size. We demonstrate a) the feasibility of incremental pattern mining by walking through each component of Percolator, b) the efficiency and scalability of Percolator over the sheer size of real-world dynamic graphs, and c) how the user-friendly GUI of Percolator inter- acts with users to support keyword-based queries that detect, browse and inspect trending patterns. We also demonstrate two user cases of Percolator, in social media trend analysis and academic collaboration analysis, respectively.
Scalable Techniques for Formal Verification
Ray, Sandip
2010-01-01
This book presents state-of-the-art approaches to formal verification techniques to seamlessly integrate different formal verification methods within a single logical foundation. It should benefit researchers and practitioners looking to get a broad overview of the spectrum of formal verification techniques, as well as approaches to combining such techniques within a single framework. Coverage includes a range of case studies showing how such combination is fruitful in developing a scalable verification methodology for industrial designs. This book outlines both theoretical and practical issue
Developing Scalable Information Security Systems
Directory of Open Access Journals (Sweden)
Valery Konstantinovich Ablekov
2013-06-01
Full Text Available Existing physical security systems has wide range of lacks, including: high cost, a large number of vulnerabilities, problems of modification and support system. This paper covers an actual problem of developing systems without this list of drawbacks. The paper presents the architecture of the information security system, which operates through the network protocol TCP/IP, including the ability to connect different types of devices and integration with existing security systems. The main advantage is a significant increase in system reliability, scalability, both vertically and horizontally, with minimal cost of both financial and time resources.
EvAg: A Scalable Peer-to-Peer Evolutionary Algorithm
Laredo, J.L.J.; Eiben, A.E.; van Steen, M.R.; Merelo, J.J.
2010-01-01
This paper studies the scalability of an Evolutionary Algorithm (EA) whose population is structured by means of a gossiping protocol and where the evolutionary operators act exclusively within the local neighborhoods. This makes the algorithm inherently suited for parallel execution in a
SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores.
Meng, Jintao; Wang, Bingqiang; Wei, Yanjie; Feng, Shengzhong; Balaji, Pavan
2014-01-01
There is a widening gap between the throughput of massive parallel sequencing machines and the ability to analyze these sequencing data. Traditional assembly methods requiring long execution time and large amount of memory on a single workstation limit their use on these massive data. This paper presents a highly scalable assembler named as SWAP-Assembler for processing massive sequencing data using thousands of cores, where SWAP is an acronym for Small World Asynchronous Parallel model. In the paper, a mathematical description of multi-step bi-directed graph (MSG) is provided to resolve the computational interdependence on merging edges, and a highly scalable computational framework for SWAP is developed to automatically preform the parallel computation of all operations. Graph cleaning and contig extension are also included for generating contigs with high quality. Experimental results show that SWAP-Assembler scales up to 2048 cores on Yanhuang dataset using only 26 minutes, which is better than several other parallel assemblers, such as ABySS, Ray, and PASHA. Results also show that SWAP-Assembler can generate high quality contigs with good N50 size and low error rate, especially it generated the longest N50 contig sizes for Fish and Yanhuang datasets. In this paper, we presented a highly scalable and efficient genome assembly software, SWAP-Assembler. Compared with several other assemblers, it showed very good performance in terms of scalability and contig quality. This software is available at: https://sourceforge.net/projects/swapassembler.
National Research Council Canada - National Science Library
Edge, Harris
1999-01-01
...), computational fluid dynamics (CFD) 6 project. Under the project, a proven zonal Navier-Stokes solver was rewritten for scalable parallel performance on both shared memory and distributed memory high performance computers...
A parallel ILP algorithm that incorporates incremental batch learning
Nuno Fonseca; Rui Camacho; Fernado Silva
2003-01-01
In this paper we tackle the problems of eciency and scala-bility faced by Inductive Logic Programming (ILP) systems. We proposethe use of parallelism to improve eciency and the use of an incrementalbatch learning to address the scalability problem. We describe a novelparallel algorithm that incorporates into ILP the method of incremen-tal batch learning. The theoretical complexity of the algorithm indicatesthat a linear speedup can be achieved.
Event metadata records as a testbed for scalable data mining
International Nuclear Information System (INIS)
Gemmeren, P van; Malon, D
2010-01-01
At a data rate of 200 hertz, event metadata records ('TAGs,' in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP-specific selection or classification rules to event records and to label such an exercise 'data mining,' but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation.
Parallelization of 2-D lattice Boltzmann codes
International Nuclear Information System (INIS)
Suzuki, Soichiro; Kaburaki, Hideo; Yokokawa, Mitsuo.
1996-03-01
Lattice Boltzmann (LB) codes to simulate two dimensional fluid flow are developed on vector parallel computer Fujitsu VPP500 and scalar parallel computer Intel Paragon XP/S. While a 2-D domain decomposition method is used for the scalar parallel LB code, a 1-D domain decomposition method is used for the vector parallel LB code to be vectorized along with the axis perpendicular to the direction of the decomposition. High parallel efficiency of 95.1% by the vector parallel calculation on 16 processors with 1152x1152 grid and 88.6% by the scalar parallel calculation on 100 processors with 800x800 grid are obtained. The performance models are developed to analyze the performance of the LB codes. It is shown by our performance models that the execution speed of the vector parallel code is about one hundred times faster than that of the scalar parallel code with the same number of processors up to 100 processors. We also analyze the scalability in keeping the available memory size of one processor element at maximum. Our performance model predicts that the execution time of the vector parallel code increases about 3% on 500 processors. Although the 1-D domain decomposition method has in general a drawback in the interprocessor communication, the vector parallel LB code is still suitable for the large scale and/or high resolution simulations. (author)
Parallelization of 2-D lattice Boltzmann codes
Energy Technology Data Exchange (ETDEWEB)
Suzuki, Soichiro; Kaburaki, Hideo; Yokokawa, Mitsuo
1996-03-01
Lattice Boltzmann (LB) codes to simulate two dimensional fluid flow are developed on vector parallel computer Fujitsu VPP500 and scalar parallel computer Intel Paragon XP/S. While a 2-D domain decomposition method is used for the scalar parallel LB code, a 1-D domain decomposition method is used for the vector parallel LB code to be vectorized along with the axis perpendicular to the direction of the decomposition. High parallel efficiency of 95.1% by the vector parallel calculation on 16 processors with 1152x1152 grid and 88.6% by the scalar parallel calculation on 100 processors with 800x800 grid are obtained. The performance models are developed to analyze the performance of the LB codes. It is shown by our performance models that the execution speed of the vector parallel code is about one hundred times faster than that of the scalar parallel code with the same number of processors up to 100 processors. We also analyze the scalability in keeping the available memory size of one processor element at maximum. Our performance model predicts that the execution time of the vector parallel code increases about 3% on 500 processors. Although the 1-D domain decomposition method has in general a drawback in the interprocessor communication, the vector parallel LB code is still suitable for the large scale and/or high resolution simulations. (author).
Design strategies for irregularly adapting parallel applications
International Nuclear Information System (INIS)
Oliker, Leonid; Biswas, Rupak; Shan, Hongzhang; Sing, Jaswinder Pal
2000-01-01
Achieving scalable performance for dynamic irregular applications is eminently challenging. Traditional message-passing approaches have been making steady progress towards this goal; however, they suffer from complex implementation requirements. The use of a global address space greatly simplifies the programming task, but can degrade the performance of dynamically adapting computations. In this work, we examine two major classes of adaptive applications, under five competing programming methodologies and four leading parallel architectures. Results indicate that it is possible to achieve message-passing performance using shared-memory programming techniques by carefully following the same high level strategies. Adaptive applications have computational work loads and communication patterns which change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient parallel implementations of such adaptive applications are therefore a challenging task. This work examines the implementation of two typical adaptive applications, Dynamic Remeshing and N-Body, across various programming paradigms and architectural platforms. We compare several critical factors of the parallel code development, including performance, programmability, scalability, algorithmic development, and portability
Space Situational Awareness Data Processing Scalability Utilizing Google Cloud Services
Greenly, D.; Duncan, M.; Wysack, J.; Flores, F.
Space Situational Awareness (SSA) is a fundamental and critical component of current space operations. The term SSA encompasses the awareness, understanding and predictability of all objects in space. As the population of orbital space objects and debris increases, the number of collision avoidance maneuvers grows and prompts the need for accurate and timely process measures. The SSA mission continually evolves to near real-time assessment and analysis demanding the need for higher processing capabilities. By conventional methods, meeting these demands requires the integration of new hardware to keep pace with the growing complexity of maneuver planning algorithms. SpaceNav has implemented a highly scalable architecture that will track satellites and debris by utilizing powerful virtual machines on the Google Cloud Platform. SpaceNav algorithms for processing CDMs outpace conventional means. A robust processing environment for tracking data, collision avoidance maneuvers and various other aspects of SSA can be created and deleted on demand. Migrating SpaceNav tools and algorithms into the Google Cloud Platform will be discussed and the trials and tribulations involved. Information will be shared on how and why certain cloud products were used as well as integration techniques that were implemented. Key items to be presented are: 1.Scientific algorithms and SpaceNav tools integrated into a scalable architecture a) Maneuver Planning b) Parallel Processing c) Monte Carlo Simulations d) Optimization Algorithms e) SW Application Development/Integration into the Google Cloud Platform 2. Compute Engine Processing a) Application Engine Automated Processing b) Performance testing and Performance Scalability c) Cloud MySQL databases and Database Scalability d) Cloud Data Storage e) Redundancy and Availability
A Practical and Scalable Tool to Find Overlaps between Sequences
Directory of Open Access Journals (Sweden)
Maan Haj Rachid
2015-01-01
Full Text Available The evolution of the next generation sequencing technology increases the demand for efficient solutions, in terms of space and time, for several bioinformatics problems. This paper presents a practical and easy-to-implement solution for one of these problems, namely, the all-pairs suffix-prefix problem, using a compact prefix tree. The paper demonstrates an efficient construction of this time-efficient and space-economical tree data structure. The paper presents techniques for parallel implementations of the proposed solution. Experimental evaluation indicates superior results in terms of space and time over existing solutions. Results also show that the proposed technique is highly scalable in a parallel execution environment.
The Node Monitoring Component of a Scalable Systems Software Environment
Energy Technology Data Exchange (ETDEWEB)
Miller, Samuel James [Iowa State Univ., Ames, IA (United States)
2006-01-01
This research describes Fountain, a suite of programs used to monitor the resources of a cluster. A cluster is a collection of individual computers that are connected via a high speed communication network. They are traditionally used by users who desire more resources, such as processing power and memory, than any single computer can provide. A common drawback to effectively utilizing such a large-scale system is the management infrastructure, which often does not often scale well as the system grows. Large-scale parallel systems provide new research challenges in the area of systems software, the programs or tools that manage the system from boot-up to running a parallel job. The approach presented in this thesis utilizes a collection of separate components that communicate with each other to achieve a common goal. While systems software comprises a broad array of components, this thesis focuses on the design choices for a node monitoring component. We will describe Fountain, an implementation of the Scalable Systems Software (SSS) node monitor specification. It is targeted at aggregate node monitoring for clusters, focusing on both scalability and fault tolerance as its design goals. It leverages widely used technologies such as XML and HTTP to present an interface to other components in the SSS environment.
Performance-scalable volumetric data classification for online industrial inspection
Abraham, Aby J.; Sadki, Mustapha; Lea, R. M.
2002-03-01
Non-intrusive inspection and non-destructive testing of manufactured objects with complex internal structures typically requires the enhancement, analysis and visualization of high-resolution volumetric data. Given the increasing availability of fast 3D scanning technology (e.g. cone-beam CT), enabling on-line detection and accurate discrimination of components or sub-structures, the inherent complexity of classification algorithms inevitably leads to throughput bottlenecks. Indeed, whereas typical inspection throughput requirements range from 1 to 1000 volumes per hour, depending on density and resolution, current computational capability is one to two orders-of-magnitude less. Accordingly, speeding up classification algorithms requires both reduction of algorithm complexity and acceleration of computer performance. A shape-based classification algorithm, offering algorithm complexity reduction, by using ellipses as generic descriptors of solids-of-revolution, and supporting performance-scalability, by exploiting the inherent parallelism of volumetric data, is presented. A two-stage variant of the classical Hough transform is used for ellipse detection and correlation of the detected ellipses facilitates position-, scale- and orientation-invariant component classification. Performance-scalability is achieved cost-effectively by accelerating a PC host with one or more COTS (Commercial-Off-The-Shelf) PCI multiprocessor cards. Experimental results are reported to demonstrate the feasibility and cost-effectiveness of the data-parallel classification algorithm for on-line industrial inspection applications.
On eliminating synchronous communication in molecular simulations to improve scalability
Straatsma, T. P.; Chavarría-Miranda, Daniel G.
2013-12-01
Molecular dynamics simulation, as a complementary tool to experimentation, has become an important methodology for the understanding and design of molecular systems as it provides access to properties that are difficult, impossible or prohibitively expensive to obtain experimentally. Many of the available software packages have been parallelized to take advantage of modern massively concurrent processing resources. The challenge in achieving parallel efficiency is commonly attributed to the fact that molecular dynamics algorithms are communication intensive. This paper illustrates how an appropriately chosen data distribution and asynchronous one-sided communication approach can be used to effectively deal with the data movement within the Global Arrays/ARMCI programming model framework. A new put_notify capability is presented here, allowing the implementation of the molecular dynamics algorithm without any explicit global or local synchronization or global data reduction operations. In addition, this push-data model is shown to very effectively allow hiding data communication behind computation. Rather than data movement or explicit global reductions, the implicit synchronization of the algorithm becomes the primary challenge for scalability. Without any explicit synchronous operations, the scalability of molecular simulations is shown to depend only on the ability to evenly balance computational load.
Compiler Technology for Parallel Scientific Computation
Directory of Open Access Journals (Sweden)
Can Özturan
1994-01-01
Full Text Available There is a need for compiler technology that, given the source program, will generate efficient parallel codes for different architectures with minimal user involvement. Parallel computation is becoming indispensable in solving large-scale problems in science and engineering. Yet, the use of parallel computation is limited by the high costs of developing the needed software. To overcome this difficulty we advocate a comprehensive approach to the development of scalable architecture-independent software for scientific computation based on our experience with equational programming language (EPL. Our approach is based on a program decomposition, parallel code synthesis, and run-time support for parallel scientific computation. The program decomposition is guided by the source program annotations provided by the user. The synthesis of parallel code is based on configurations that describe the overall computation as a set of interacting components. Run-time support is provided by the compiler-generated code that redistributes computation and data during object program execution. The generated parallel code is optimized using techniques of data alignment, operator placement, wavefront determination, and memory optimization. In this article we discuss annotations, configurations, parallel code generation, and run-time support suitable for parallel programs written in the functional parallel programming language EPL and in Fortran.
Parallel multigrid smoothing: polynomial versus Gauss-Seidel
International Nuclear Information System (INIS)
Adams, Mark; Brezina, Marian; Hu, Jonathan; Tuminaro, Ray
2003-01-01
Gauss-Seidel is often the smoother of choice within multigrid applications. In the context of unstructured meshes, however, maintaining good parallel efficiency is difficult with multiplicative iterative methods such as Gauss-Seidel. This leads us to consider alternative smoothers. We discuss the computational advantages of polynomial smoothers within parallel multigrid algorithms for positive definite symmetric systems. Two particular polynomials are considered: Chebyshev and a multilevel specific polynomial. The advantages of polynomial smoothing over traditional smoothers such as Gauss-Seidel are illustrated on several applications: Poisson's equation, thin-body elasticity, and eddy current approximations to Maxwell's equations. While parallelizing the Gauss-Seidel method typically involves a compromise between a scalable convergence rate and maintaining high flop rates, polynomial smoothers achieve parallel scalable multigrid convergence rates without sacrificing flop rates. We show that, although parallel computers are the main motivation, polynomial smoothers are often surprisingly competitive with Gauss-Seidel smoothers on serial machines
Parallel multigrid smoothing: polynomial versus Gauss-Seidel
Adams, Mark; Brezina, Marian; Hu, Jonathan; Tuminaro, Ray
2003-07-01
Gauss-Seidel is often the smoother of choice within multigrid applications. In the context of unstructured meshes, however, maintaining good parallel efficiency is difficult with multiplicative iterative methods such as Gauss-Seidel. This leads us to consider alternative smoothers. We discuss the computational advantages of polynomial smoothers within parallel multigrid algorithms for positive definite symmetric systems. Two particular polynomials are considered: Chebyshev and a multilevel specific polynomial. The advantages of polynomial smoothing over traditional smoothers such as Gauss-Seidel are illustrated on several applications: Poisson's equation, thin-body elasticity, and eddy current approximations to Maxwell's equations. While parallelizing the Gauss-Seidel method typically involves a compromise between a scalable convergence rate and maintaining high flop rates, polynomial smoothers achieve parallel scalable multigrid convergence rates without sacrificing flop rates. We show that, although parallel computers are the main motivation, polynomial smoothers are often surprisingly competitive with Gauss-Seidel smoothers on serial machines.
Scalable domain decomposition solvers for stochastic PDEs in high performance computing
International Nuclear Information System (INIS)
Desai, Ajit; Pettit, Chris; Poirel, Dominique; Sarkar, Abhijit
2017-01-01
Stochastic spectral finite element models of practical engineering systems may involve solutions of linear systems or linearized systems for non-linear problems with billions of unknowns. For stochastic modeling, it is therefore essential to design robust, parallel and scalable algorithms that can efficiently utilize high-performance computing to tackle such large-scale systems. Domain decomposition based iterative solvers can handle such systems. And though these algorithms exhibit excellent scalabilities, significant algorithmic and implementational challenges exist to extend them to solve extreme-scale stochastic systems using emerging computing platforms. Intrusive polynomial chaos expansion based domain decomposition algorithms are extended here to concurrently handle high resolution in both spatial and stochastic domains using an in-house implementation. Sparse iterative solvers with efficient preconditioners are employed to solve the resulting global and subdomain level local systems through multi-level iterative solvers. We also use parallel sparse matrix–vector operations to reduce the floating-point operations and memory requirements. Numerical and parallel scalabilities of these algorithms are presented for the diffusion equation having spatially varying diffusion coefficient modeled by a non-Gaussian stochastic process. Scalability of the solvers with respect to the number of random variables is also investigated.
An Introduction to Parallelism, Concurrency and Acceleration (1/2)
CERN. Geneva
2016-01-01
Concurrency and parallelism are firm elements of any modern computing infrastructure, made even more prominent by the emergence of accelerators. These lectures offer an introduction to these important concepts. We will begin with a brief refresher of recent hardware offerings to modern-day programmers. We will then open the main discussion with an overview of the laws and practical aspects of scalability. Key parallelism data structures, patterns and algorithms will be shown. The main threats to scalability and mitigation strategies will be discussed in the context of real-life optimization problems.
International Nuclear Information System (INIS)
Jejcic, A.; Maillard, J.; Maurel, G.; Silva, J.; Wolff-Bacha, F.
1997-01-01
The work in the field of parallel processing has developed as research activities using several numerical Monte Carlo simulations related to basic or applied current problems of nuclear and particle physics. For the applications utilizing the GEANT code development or improvement works were done on parts simulating low energy physical phenomena like radiation, transport and interaction. The problem of actinide burning by means of accelerators was approached using a simulation with the GEANT code. A program of neutron tracking in the range of low energies up to the thermal region has been developed. It is coupled to the GEANT code and permits in a single pass the simulation of a hybrid reactor core receiving a proton burst. Other works in this field refers to simulations for nuclear medicine applications like, for instance, development of biological probes, evaluation and characterization of the gamma cameras (collimators, crystal thickness) as well as the method for dosimetric calculations. Particularly, these calculations are suited for a geometrical parallelization approach especially adapted to parallel machines of the TN310 type. Other works mentioned in the same field refer to simulation of the electron channelling in crystals and simulation of the beam-beam interaction effect in colliders. The GEANT code was also used to simulate the operation of germanium detectors designed for natural and artificial radioactivity monitoring of environment
Requirements for Scalable Access Control and Security Management Architectures
National Research Council Canada - National Science Library
Keromytis, Angelos D; Smith, Jonathan M
2005-01-01
Maximizing local autonomy has led to a scalable Internet. Scalability and the capacity for distributed control have unfortunately not extended well to resource access control policies and mechanisms...
Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
Energy Technology Data Exchange (ETDEWEB)
Mohror, Kathryn [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Moody, Adam [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Bronevetsky, Greg [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); de Supinski, Bronis R. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2014-09-01
High-performance computing (HPC) systems are growing more powerful by utilizing more components. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. But, at scale, the cost of checkpointing becomes prohibitive. A solution to this problem is multilevel checkpointing, which employs multiple types of checkpoints in a single run. Moreover, lightweight checkpoints can handle the most common failure modes, while more expensive checkpoints can handle severe failures. We designed a multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library, that writes lightweight checkpoints to node-local storage in addition to the parallel file system. We present probabilistic Markov models of SCR's performance. We show that on future large-scale systems, SCR can lead to a gain in machine efficiency of up to 35 percent, and reduce the load on the parallel file system by a factor of two. In addition, we predict that checkpoint scavenging, or only writing checkpoints to the parallel file system on application termination, can reduce the load on the parallel file system by 20 × on today's systems and still maintain high application efficiency.
Adaptive format conversion for scalable video coding
Wan, Wade K.; Lim, Jae S.
2001-12-01
The enhancement layer in many scalable coding algorithms is composed of residual coding information. There is another type of information that can be transmitted instead of (or in addition to) residual coding. Since the encoder has access to the original sequence, it can utilize adaptive format conversion (AFC) to generate the enhancement layer and transmit the different format conversion methods as enhancement data. This paper investigates the use of adaptive format conversion information as enhancement data in scalable video coding. Experimental results are shown for a wide range of base layer qualities and enhancement bitrates to determine when AFC can improve video scalability. Since the parameters needed for AFC are small compared to residual coding, AFC can provide video scalability at low enhancement layer bitrates that are not possible with residual coding. In addition, AFC can also be used in addition to residual coding to improve video scalability at higher enhancement layer bitrates. Adaptive format conversion has not been studied in detail, but many scalable applications may benefit from it. An example of an application that AFC is well-suited for is the migration path for digital television where AFC can provide immediate video scalability as well as assist future migrations.
McCallum, Ethan
2011-01-01
It's tough to argue with R as a high-quality, cross-platform, open source statistical software product-unless you're in the business of crunching Big Data. This concise book introduces you to several strategies for using R to analyze large datasets. You'll learn the basics of Snow, Multicore, Parallel, and some Hadoop-related tools, including how to find them, how to use them, when they work well, and when they don't. With these packages, you can overcome R's single-threaded nature by spreading work across multiple CPUs, or offloading work to multiple machines to address R's memory barrier.
ALADDIN - enhancing applicability and scalability
International Nuclear Information System (INIS)
Roverso, Davide
2001-02-01
The ALADDIN project aims at the study and development of flexible, accurate, and reliable techniques and principles for computerised event classification and fault diagnosis for complex machinery and industrial processes. The main focus of the project is on advanced numerical techniques, such as wavelets, and empirical modelling with neural networks. This document reports on recent important advancements, which significantly widen the practical applicability of the developed principles, both in terms of flexibility of use, and in terms of scalability to large problem domains. In particular, two novel techniques are here described. The first, which we call Wavelet On- Line Pre-processing (WOLP), is aimed at extracting, on-line, relevant dynamic features from the process data streams. This technique allows a system a greater flexibility in detecting and processing transients at a range of different time scales. The second technique, which we call Autonomous Recursive Task Decomposition (ARTD), is aimed at tackling the problem of constructing a classifier able to discriminate among a large number of different event/fault classes, which is often the case when the application domain is a complex industrial process. ARTD also allows for incremental application development (i.e. the incremental addition of new classes to an existing classifier, without the need of retraining the entire system), and for simplified application maintenance. The description of these novel techniques is complemented by reports of quantitative experiments that show in practice the extent of these improvements. (Author)
Fast and scalable inequality joins
Khayyat, Zuhair
2016-09-07
Inequality joins, which is to join relations with inequality conditions, are used in various applications. Optimizing joins has been the subject of intensive research ranging from efficient join algorithms such as sort-merge join, to the use of efficient indices such as (Formula presented.)-tree, (Formula presented.)-tree and Bitmap. However, inequality joins have received little attention and queries containing such joins are notably very slow. In this paper, we introduce fast inequality join algorithms based on sorted arrays and space-efficient bit-arrays. We further introduce a simple method to estimate the selectivity of inequality joins which is then used to optimize multiple predicate queries and multi-way joins. Moreover, we study an incremental inequality join algorithm to handle scenarios where data keeps changing. We have implemented a centralized version of these algorithms on top of PostgreSQL, a distributed version on top of Spark SQL, and an existing data cleaning system, Nadeef. By comparing our algorithms against well-known optimization techniques for inequality joins, we show our solution is more scalable and several orders of magnitude faster. © 2016 Springer-Verlag Berlin Heidelberg
Impact of multiplexed reading scheme on nanocrossbar memristor memory's scalability
International Nuclear Information System (INIS)
Zhu Xuan; Tang Yu-Hua; Wu Jun-Jie; Yi Xun; Wu Chun-Qing
2014-01-01
Nanocrossbar is a potential memory architecture to integrate memristor to achieve large scale and high density memory. However, based on the currently widely-adopted parallel reading scheme, scalability of the nanocrossbar memory is limited, since the overhead of the reading circuits is in proportion with the size of the nanocrossbar component. In this paper, a multiplexed reading scheme is adopted as the foundation of the discussion. Through HSPICE simulation, we reanalyze scalability of the nanocrossbar memristor memory by investigating the impact of various circuit parameters on the output voltage swing as the memory scales to larger size. We find that multiplexed reading maintains sufficient noise margin in large size nanocrossbar memristor memory. In order to improve the scalability of the memory, memristors with nonlinear I—V characteristics and high LRS (low resistive state) resistance should be adopted. (interdisciplinary physics and related areas of science and technology)
Embedded High Performance Scalable Computing Systems
National Research Council Canada - National Science Library
Ngo, David
2003-01-01
The Embedded High Performance Scalable Computing Systems (EHPSCS) program is a cooperative agreement between Sanders, A Lockheed Martin Company and DARPA that ran for three years, from Apr 1995 - Apr 1998...
Directory of Open Access Journals (Sweden)
James G. Worner
2017-05-01
Full Text Available James Worner is an Australian-based writer and scholar currently pursuing a PhD at the University of Technology Sydney. His research seeks to expose masculinities lost in the shadow of Australia’s Anzac hegemony while exploring new opportunities for contemporary historiography. He is the recipient of the Doctoral Scholarship in Historical Consciousness at the university’s Australian Centre of Public History and will be hosted by the University of Bologna during 2017 on a doctoral research writing scholarship. ‘Parallel Lines’ is one of a collection of stories, The Shapes of Us, exploring liminal spaces of modern life: class, gender, sexuality, race, religion and education. It looks at lives, like lines, that do not meet but which travel in proximity, simultaneously attracted and repelled. James’ short stories have been published in various journals and anthologies.
Parallelization of a numerical simulation code for isotropic turbulence
International Nuclear Information System (INIS)
Sato, Shigeru; Yokokawa, Mitsuo; Watanabe, Tadashi; Kaburaki, Hideo.
1996-03-01
A parallel pseudospectral code which solves the three-dimensional Navier-Stokes equation by direct numerical simulation is developed and execution time, parallelization efficiency, load balance and scalability are evaluated. A vector parallel supercomputer, Fujitsu VPP500 with up to 16 processors is used for this calculation for Fourier modes up to 256x256x256 using 16 processors. Good scalability for number of processors is achieved when number of Fourier mode is fixed. For small Fourier modes, calculation time of the program is proportional to NlogN which is ideal complexity of calculation for 3D-FFT on vector parallel processors. It is found that the calculation performance decreases as the increase of the Fourier modes. (author)
Resource-aware complexity scalability for mobile MPEG encoding
Mietens, S.O.; With, de P.H.N.; Hentschel, C.; Panchanatan, S.; Vasudev, B.
2004-01-01
Complexity scalability attempts to scale the required resources of an algorithm with the chose quality settings, in order to broaden the application range. In this paper, we present complexity-scalable MPEG encoding of which the core processing modules are modified for scalability. Scalability is
Multiple Independent File Parallel I/O with HDF5
Energy Technology Data Exchange (ETDEWEB)
Miller, M. C.
2016-07-13
The HDF5 library has supported the I/O requirements of HPC codes at Lawrence Livermore National Labs (LLNL) since the late 90’s. In particular, HDF5 used in the Multiple Independent File (MIF) parallel I/O paradigm has supported LLNL code’s scalable I/O requirements and has recently been gainfully used at scales as large as O(10^{6}) parallel tasks.
Scalable Distributed Architectures for Information Retrieval
National Research Council Canada - National Science Library
Lu, Zhihong
1999-01-01
.... Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection...
Parallel alternating direction preconditioner for isogeometric simulations of explicit dynamics
Łoś, Marcin; Woźniak, Maciej; Paszyński, Maciej; Dalcin, Lisandro; Calo, Victor M.
2015-01-01
incorporated as a part of PETIGA an isogeometric framework [7] build on top of PETSc [8]. We show the scalability of the parallel algorithm on STAMPEDE linux cluster up to 10,000 processors, as well as the convergence rate of the PCG solver
A highly scalable peptide-based assay system for proteomics.
Directory of Open Access Journals (Sweden)
Igor A Kozlov
Full Text Available We report a scalable and cost-effective technology for generating and screening high-complexity customizable peptide sets. The peptides are made as peptide-cDNA fusions by in vitro transcription/translation from pools of DNA templates generated by microarray-based synthesis. This approach enables large custom sets of peptides to be designed in silico, manufactured cost-effectively in parallel, and assayed efficiently in a multiplexed fashion. The utility of our peptide-cDNA fusion pools was demonstrated in two activity-based assays designed to discover protease and kinase substrates. In the protease assay, cleaved peptide substrates were separated from uncleaved and identified by digital sequencing of their cognate cDNAs. We screened the 3,011 amino acid HCV proteome for susceptibility to cleavage by the HCV NS3/4A protease and identified all 3 known trans cleavage sites with high specificity. In the kinase assay, peptide substrates phosphorylated by tyrosine kinases were captured and identified by sequencing of their cDNAs. We screened a pool of 3,243 peptides against Abl kinase and showed that phosphorylation events detected were specific and consistent with the known substrate preferences of Abl kinase. Our approach is scalable and adaptable to other protein-based assays.
Scalable fast multipole methods for vortex element methods
Hu, Qi
2012-11-01
We use a particle-based method to simulate incompressible flows, where the Fast Multipole Method (FMM) is used to accelerate the calculation of particle interactions. The most time-consuming kernelsâ\\'the Biot-Savart equation and stretching term of the vorticity equationâ\\'are mathematically reformulated so that only two Laplace scalar potentials are used instead of six, while automatically ensuring divergence-free far-field computation. Based on this formulation, and on our previous work for a scalar heterogeneous FMM algorithm, we develop a new FMM-based vortex method capable of simulating general flows including turbulence on heterogeneous architectures, which distributes the work between multi-core CPUs and GPUs to best utilize the hardware resources and achieve excellent scalability. The algorithm also uses new data structures which can dynamically manage inter-node communication and load balance efficiently but with only a small parallel construction overhead. This algorithm can scale to large-sized clusters showing both strong and weak scalability. Careful error and timing trade-off analysis are also performed for the cutoff functions induced by the vortex particle method. Our implementation can perform one time step of the velocity+stretching for one billion particles on 32 nodes in 55.9 seconds, which yields 49.12 Tflop/s. © 2012 IEEE.
GPU-based Scalable Volumetric Reconstruction for Multi-view Stereo
Energy Technology Data Exchange (ETDEWEB)
Kim, H; Duchaineau, M; Max, N
2011-09-21
We present a new scalable volumetric reconstruction algorithm for multi-view stereo using a graphics processing unit (GPU). It is an effectively parallelized GPU algorithm that simultaneously uses a large number of GPU threads, each of which performs voxel carving, in order to integrate depth maps with images from multiple views. Each depth map, triangulated from pair-wise semi-dense correspondences, represents a view-dependent surface of the scene. This algorithm also provides scalability for large-scale scene reconstruction in a high resolution voxel grid by utilizing streaming and parallel computation. The output is a photo-realistic 3D scene model in a volumetric or point-based representation. We demonstrate the effectiveness and the speed of our algorithm with a synthetic scene and real urban/outdoor scenes. Our method can also be integrated with existing multi-view stereo algorithms such as PMVS2 to fill holes or gaps in textureless regions.
Parallel programming practical aspects, models and current limitations
Tarkov, Mikhail S
2014-01-01
Parallel programming is designed for the use of parallel computer systems for solving time-consuming problems that cannot be solved on a sequential computer in a reasonable time. These problems can be divided into two classes: 1. Processing large data arrays (including processing images and signals in real time)2. Simulation of complex physical processes and chemical reactions For each of these classes, prospective methods are designed for solving problems. For data processing, one of the most promising technologies is the use of artificial neural networks. Particles-in-cell method and cellular automata are very useful for simulation. Problems of scalability of parallel algorithms and the transfer of existing parallel programs to future parallel computers are very acute now. An important task is to optimize the use of the equipment (including the CPU cache) of parallel computers. Along with parallelizing information processing, it is essential to ensure the processing reliability by the relevant organization ...
Parallel processing of genomics data
Agapito, Giuseppe; Guzzi, Pietro Hiram; Cannataro, Mario
2016-10-01
The availability of high-throughput experimental platforms for the analysis of biological samples, such as mass spectrometry, microarrays and Next Generation Sequencing, have made possible to analyze a whole genome in a single experiment. Such platforms produce an enormous volume of data per single experiment, thus the analysis of this enormous flow of data poses several challenges in term of data storage, preprocessing, and analysis. To face those issues, efficient, possibly parallel, bioinformatics software needs to be used to preprocess and analyze data, for instance to highlight genetic variation associated with complex diseases. In this paper we present a parallel algorithm for the parallel preprocessing and statistical analysis of genomics data, able to face high dimension of data and resulting in good response time. The proposed system is able to find statistically significant biological markers able to discriminate classes of patients that respond to drugs in different ways. Experiments performed on real and synthetic genomic datasets show good speed-up and scalability.
Overview of the Scalable Coherent Interface, IEEE STD 1596 (SCI)
International Nuclear Information System (INIS)
Gustavson, D.B.; James, D.V.; Wiggers, H.A.
1992-10-01
The Scalable Coherent Interface standard defines a new generation of interconnection that spans the full range from supercomputer memory 'bus' to campus-wide network. SCI provides bus-like services and a shared-memory software model while using an underlying, packet protocol on many independent communication links. Initially these links are 1 GByte/s (wires) and 1 GBit/s (fiber), but the protocol scales well to future faster or lower-cost technologies. The interconnect may use switches, meshes, and rings. The SCI distributed-shared-memory model is simple and versatile, enabling for the first time a smooth integration of highly parallel multiprocessors, workstations, personal computers, I/O, networking and data acquisition
Simplifying Scalable Graph Processing with a Domain-Specific Language
Hong, Sungpack; Salihoglu, Semih; Widom, Jennifer; Olukotun, Kunle
2014-01-01
Large-scale graph processing, with its massive data sets, requires distributed processing. However, conventional frameworks for distributed graph processing, such as Pregel, use non-traditional programming models that are well-suited for parallelism and scalability but inconvenient for implementing non-trivial graph algorithms. In this paper, we use Green-Marl, a Domain-Specific Language for graph analysis, to intuitively describe graph algorithms and extend its compiler to generate equivalent Pregel implementations. Using the semantic information captured by Green-Marl, the compiler applies a set of transformation rules that convert imperative graph algorithms into Pregel's programming model. Our experiments show that the Pregel programs generated by the Green-Marl compiler perform similarly to manually coded Pregel implementations of the same algorithms. The compiler is even able to generate a Pregel implementation of a complicated graph algorithm for which a manual Pregel implementation is very challenging.
Simplifying Scalable Graph Processing with a Domain-Specific Language
Hong, Sungpack
2014-01-01
Large-scale graph processing, with its massive data sets, requires distributed processing. However, conventional frameworks for distributed graph processing, such as Pregel, use non-traditional programming models that are well-suited for parallelism and scalability but inconvenient for implementing non-trivial graph algorithms. In this paper, we use Green-Marl, a Domain-Specific Language for graph analysis, to intuitively describe graph algorithms and extend its compiler to generate equivalent Pregel implementations. Using the semantic information captured by Green-Marl, the compiler applies a set of transformation rules that convert imperative graph algorithms into Pregel\\'s programming model. Our experiments show that the Pregel programs generated by the Green-Marl compiler perform similarly to manually coded Pregel implementations of the same algorithms. The compiler is even able to generate a Pregel implementation of a complicated graph algorithm for which a manual Pregel implementation is very challenging.
Implementation of the Timepix ASIC in the Scalable Readout System
Energy Technology Data Exchange (ETDEWEB)
Lupberger, M., E-mail: lupberger@physik.uni-bonn.de; Desch, K.; Kaminski, J.
2016-09-11
We report on the development of electronics hardware, FPGA firmware and software to provide a flexible multi-chip readout of the Timepix ASIC within the framework of the Scalable Readout System (SRS). The system features FPGA-based zero-suppression and the possibility to read out up to 4×8 chips with a single Front End Concentrator (FEC). By operating several FECs in parallel, in principle an arbitrary number of chips can be read out, exploiting the scaling features of SRS. Specifically, we tested the system with a setup consisting of 160 Timepix ASICs, operated as GridPix devices in a large TPC field cage in a 1 T magnetic field at a DESY test beam facility providing an electron beam of up to 6 GeV. We discuss the design choices, the dedicated hardware components, the FPGA firmware as well as the performance of the system in the test beam.
fastBMA: scalable network inference and transitive reduction.
Hung, Ling-Hong; Shi, Kaiyuan; Wu, Migao; Young, William Chad; Raftery, Adrian E; Yeung, Ka Yee
2017-10-01
Inferring genetic networks from genome-wide expression data is extremely demanding computationally. We have developed fastBMA, a distributed, parallel, and scalable implementation of Bayesian model averaging (BMA) for this purpose. fastBMA also includes a computationally efficient module for eliminating redundant indirect edges in the network by mapping the transitive reduction to an easily solved shortest-path problem. We evaluated the performance of fastBMA on synthetic data and experimental genome-wide time series yeast and human datasets. When using a single CPU core, fastBMA is up to 100 times faster than the next fastest method, LASSO, with increased accuracy. It is a memory-efficient, parallel, and distributed application that scales to human genome-wide expression data. A 10 000-gene regulation network can be obtained in a matter of hours using a 32-core cloud cluster (2 nodes of 16 cores). fastBMA is a significant improvement over its predecessor ScanBMA. It is more accurate and orders of magnitude faster than other fast network inference methods such as the 1 based on LASSO. The improved scalability allows it to calculate networks from genome scale data in a reasonable time frame. The transitive reduction method can improve accuracy in denser networks. fastBMA is available as code (M.I.T. license) from GitHub (https://github.com/lhhunghimself/fastBMA), as part of the updated networkBMA Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/networkBMA.html) and as ready-to-deploy Docker images (https://hub.docker.com/r/biodepot/fastbma/). © The Authors 2017. Published by Oxford University Press.
A Numerical Study of Scalable Cardiac Electro-Mechanical Solvers on HPC Architectures
Directory of Open Access Journals (Sweden)
Piero Colli Franzone
2018-04-01
Full Text Available We introduce and study some scalable domain decomposition preconditioners for cardiac electro-mechanical 3D simulations on parallel HPC (High Performance Computing architectures. The electro-mechanical model of the cardiac tissue is composed of four coupled sub-models: (1 the static finite elasticity equations for the transversely isotropic deformation of the cardiac tissue; (2 the active tension model describing the dynamics of the intracellular calcium, cross-bridge binding and myofilament tension; (3 the anisotropic Bidomain model describing the evolution of the intra- and extra-cellular potentials in the deforming cardiac tissue; and (4 the ionic membrane model describing the dynamics of ionic currents, gating variables, ionic concentrations and stretch-activated channels. This strongly coupled electro-mechanical model is discretized in time with a splitting semi-implicit technique and in space with isoparametric finite elements. The resulting scalable parallel solver is based on Multilevel Additive Schwarz preconditioners for the solution of the Bidomain system and on BDDC preconditioned Newton-Krylov solvers for the non-linear finite elasticity system. The results of several 3D parallel simulations show the scalability of both linear and non-linear solvers and their application to the study of both physiological excitation-contraction cardiac dynamics and re-entrant waves in the presence of different mechano-electrical feedbacks.
Evaluating the scalability of HEP software and multi-core hardware
Jarp, S; Leduc, J; Nowak, A
2011-01-01
As researchers have reached the practical limits of processor performance improvements by frequency scaling, it is clear that the future of computing lies in the effective utilization of parallel and multi-core architectures. Since this significant change in computing is well underway, it is vital for HEP programmers to understand the scalability of their software on modern hardware and the opportunities for potential improvements. This work aims to quantify the benefit of new mainstream architectures to the HEP community through practical benchmarking on recent hardware solutions, including the usage of parallelized HEP applications.
Evolution of a minimal parallel programming model
International Nuclear Information System (INIS)
Lusk, Ewing; Butler, Ralph; Pieper, Steven C.
2017-01-01
Here, we take a historical approach to our presentation of self-scheduled task parallelism, a programming model with its origins in early irregular and nondeterministic computations encountered in automated theorem proving and logic programming. We show how an extremely simple task model has evolved into a system, asynchronous dynamic load balancing (ADLB), and a scalable implementation capable of supporting sophisticated applications on today’s (and tomorrow’s) largest supercomputers; and we illustrate the use of ADLB with a Green’s function Monte Carlo application, a modern, mature nuclear physics code in production use. Our lesson is that by surrendering a certain amount of generality and thus applicability, a minimal programming model (in terms of its basic concepts and the size of its application programmer interface) can achieve extreme scalability without introducing complexity.
International Nuclear Information System (INIS)
Zhang, B.; Li, G.; Wang, W.; Shangguan, D.; Deng, L.
2015-01-01
This paper introduces the Strategy of multilevel hybrid parallelism of JCOGIN Infrastructure on Monte Carlo Particle Transport for the large-scale full-core pin-by-pin simulations. The particle parallelism, domain decomposition parallelism and MPI/OpenMP parallelism are designed and implemented. By the testing, JMCT presents the parallel scalability of JCOGIN, which reaches the parallel efficiency 80% on 120,000 cores for the pin-by-pin computation of the BEAVRS benchmark. (author)
Parallel algorithms for 2-D cylindrical transport equations of Eigenvalue problem
International Nuclear Information System (INIS)
Wei, J.; Yang, S.
2013-01-01
In this paper, aimed at the neutron transport equations of eigenvalue problem under 2-D cylindrical geometry on unstructured grid, the discrete scheme of Sn discrete ordinate and discontinuous finite is built, and the parallel computation for the scheme is realized on MPI systems. Numerical experiments indicate that the designed parallel algorithm can reach perfect speedup, it has good practicality and scalability. (authors)
The Concept of Business Model Scalability
DEFF Research Database (Denmark)
Lund, Morten; Nielsen, Christian
2018-01-01
-term pro table business. However, the main message of this article is that while providing a good value proposition may help the rm ‘get by’, the really successful businesses of today are those able to reach the sweet-spot of business model scalability. Design/Methodology/Approach: The article is based...... on a ve-year longitudinal action research project of over 90 companies that participated in the International Center for Innovation project aimed at building 10 global network-based business models. Findings: This article introduces and discusses the term scalability from a company-level perspective......Purpose: The purpose of the article is to de ne what scalable business models are. Central to the contemporary understanding of business models is the value proposition towards the customer and the hypotheses generated about delivering value to the customer which become a good foundation for a long...
Declarative and Scalable Selection for Map Visualizations
DEFF Research Database (Denmark)
Kefaloukos, Pimin Konstantin Balic
and is itself a source and cause of prolific data creation. This calls for scalable map processing techniques that can handle the data volume and which play well with the predominant data models on the Web. (4) Maps are now consumed around the clock by a global audience. While historical maps were singleuser......-defined constraints as well as custom objectives. The purpose of the language is to derive a target multi-scale database from a source database according to holistic specifications. (b) The Glossy SQL compiler allows Glossy SQL to be scalably executed in a spatial analytics system, such as a spatial relational......, there are indications that the method is scalable for databases that contain millions of records, especially if the target language of the compiler is substituted by a cluster-ready variant of SQL. While several realistic use cases for maps have been implemented in CVL, additional non-geographic data visualization uses...
Scalable Density-Based Subspace Clustering
DEFF Research Database (Denmark)
Müller, Emmanuel; Assent, Ira; Günnemann, Stephan
2011-01-01
For knowledge discovery in high dimensional databases, subspace clustering detects clusters in arbitrary subspace projections. Scalability is a crucial issue, as the number of possible projections is exponential in the number of dimensions. We propose a scalable density-based subspace clustering...... method that steers mining to few selected subspace clusters. Our novel steering technique reduces subspace processing by identifying and clustering promising subspaces and their combinations directly. Thereby, it narrows down the search space while maintaining accuracy. Thorough experiments on real...... and synthetic databases show that steering is efficient and scalable, with high quality results. For future work, our steering paradigm for density-based subspace clustering opens research potential for speeding up other subspace clustering approaches as well....
Software performance and scalability a quantitative approach
Liu, Henry H
2009-01-01
Praise from the Reviewers:"The practicality of the subject in a real-world situation distinguishes this book from othersavailable on the market."—Professor Behrouz Far, University of Calgary"This book could replace the computer organization texts now in use that every CS and CpEstudent must take. . . . It is much needed, well written, and thoughtful."—Professor Larry Bernstein, Stevens Institute of TechnologyA distinctive, educational text onsoftware performance and scalabilityThis is the first book to take a quantitative approach to the subject of software performance and scalability
From Digital Disruption to Business Model Scalability
DEFF Research Database (Denmark)
Nielsen, Christian; Lund, Morten; Thomsen, Peter Poulsen
2017-01-01
This article discusses the terms disruption, digital disruption, business models and business model scalability. It illustrates how managers should be using these terms for the benefit of their business by developing business models capable of achieving exponentially increasing returns to scale...... will seldom lead to business model scalability capable of competing with digital disruption(s)....... as a response to digital disruption. A series of case studies illustrate that besides frequent existing messages in the business literature relating to the importance of creating agile businesses, both in growing and declining economies, as well as hard to copy value propositions or value propositions that take...
Content-Aware Scalability-Type Selection for Rate Adaptation of Scalable Video
Directory of Open Access Journals (Sweden)
Tekalp A Murat
2007-01-01
Full Text Available Scalable video coders provide different scaling options, such as temporal, spatial, and SNR scalabilities, where rate reduction by discarding enhancement layers of different scalability-type results in different kinds and/or levels of visual distortion depend on the content and bitrate. This dependency between scalability type, video content, and bitrate is not well investigated in the literature. To this effect, we first propose an objective function that quantifies flatness, blockiness, blurriness, and temporal jerkiness artifacts caused by rate reduction by spatial size, frame rate, and quantization parameter scaling. Next, the weights of this objective function are determined for different content (shot types and different bitrates using a training procedure with subjective evaluation. Finally, a method is proposed for choosing the best scaling type for each temporal segment that results in minimum visual distortion according to this objective function given the content type of temporal segments. Two subjective tests have been performed to validate the proposed procedure for content-aware selection of the best scalability type on soccer videos. Soccer videos scaled from 600 kbps to 100 kbps by the proposed content-aware selection of scalability type have been found visually superior to those that are scaled using a single scalability option over the whole sequence.
Using scalable vector graphics to evolve art
den Heijer, E.; Eiben, A. E.
2016-01-01
In this paper, we describe our investigations of the use of scalable vector graphics as a genotype representation in evolutionary art. We describe the technical aspects of using SVG in evolutionary art, and explain our custom, SVG specific operators initialisation, mutation and crossover. We perform
Scalable Open Source Smart Grid Simulator (SGSim)
DEFF Research Database (Denmark)
Ebeid, Emad Samuel Malki; Jacobsen, Rune Hylsberg; Stefanni, Francesco
2017-01-01
. This paper presents an open source smart grid simulator (SGSim). The simulator is based on open source SystemC Network Simulation Library (SCNSL) and aims to model scalable smart grid applications. SGSim has been tested under different smart grid scenarios that contain hundreds of thousands of households...
Cooperative Scalable Moving Continuous Query Processing
DEFF Research Database (Denmark)
Li, Xiaohui; Karras, Panagiotis; Jensen, Christian S.
2012-01-01
of the global view and handle the majority of the workload. Meanwhile, moving clients, having basic memory and computation resources, handle small portions of the workload. This model is further enhanced by dynamic region allocation and grid size adjustment mechanisms that reduce the communication...... and computation cost for both servers and clients. An experimental study demonstrates that our approaches offer better scalability than competitors...
Scalable optical switches for computing applications
White, I.H.; Aw, E.T.; Williams, K.A.; Wang, Haibo; Wonfor, A.; Penty, R.V.
2009-01-01
A scalable photonic interconnection network architecture is proposed whereby a Clos network is populated with broadcast-and-select stages. This enables the efficient exploitation of an emerging class of photonic integrated switch fabric. A low distortion space switch technology based on recently
SAChES: Scalable Adaptive Chain-Ensemble Sampling.
Energy Technology Data Exchange (ETDEWEB)
Swiler, Laura Painton [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Ray, Jaideep [Sandia National Lab. (SNL-CA), Livermore, CA (United States); Ebeida, Mohamed Salah [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Huang, Maoyi [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Hou, Zhangshuan [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Bao, Jie [Pacific Northwest National Lab. (PNNL), Richland, WA (United States); Ren, Huiying [Pacific Northwest National Lab. (PNNL), Richland, WA (United States)
2017-08-01
We present the development of a parallel Markov Chain Monte Carlo (MCMC) method called SAChES, Scalable Adaptive Chain-Ensemble Sampling. This capability is targed to Bayesian calibration of com- putationally expensive simulation models. SAChES involves a hybrid of two methods: Differential Evo- lution Monte Carlo followed by Adaptive Metropolis. Both methods involve parallel chains. Differential evolution allows one to explore high-dimensional parameter spaces using loosely coupled (i.e., largely asynchronous) chains. Loose coupling allows the use of large chain ensembles, with far more chains than the number of parameters to explore. This reduces per-chain sampling burden, enables high-dimensional inversions and the use of computationally expensive forward models. The large number of chains can also ameliorate the impact of silent-errors, which may affect only a few chains. The chain ensemble can also be sampled to provide an initial condition when an aberrant chain is re-spawned. Adaptive Metropolis takes the best points from the differential evolution and efficiently hones in on the poste- rior density. The multitude of chains in SAChES is leveraged to (1) enable efficient exploration of the parameter space; and (2) ensure robustness to silent errors which may be unavoidable in extreme-scale computational platforms of the future. This report outlines SAChES, describes four papers that are the result of the project, and discusses some additional results.
Task Oriented Tools for Information Retrieval
Yang, Peilin
2017-01-01
Information Retrieval (IR) is one of the most evolving research fields and has drawn extensive attention in recent years. Because of its empirical nature, the advance of the IR field is closely related to the development of various toolkits. While the traditional IR toolkit mainly provides a platform to evaluate the effectiveness of retrieval…
Analysis of scalability of high-performance 3D image processing platform for virtual colonoscopy.
Yoshida, Hiroyuki; Wu, Yin; Cai, Wenli
2014-03-19
One of the key challenges in three-dimensional (3D) medical imaging is to enable the fast turn-around time, which is often required for interactive or real-time response. This inevitably requires not only high computational power but also high memory bandwidth due to the massive amount of data that need to be processed. For this purpose, we previously developed a software platform for high-performance 3D medical image processing, called HPC 3D-MIP platform, which employs increasingly available and affordable commodity computing systems such as the multicore, cluster, and cloud computing systems. To achieve scalable high-performance computing, the platform employed size-adaptive, distributable block volumes as a core data structure for efficient parallelization of a wide range of 3D-MIP algorithms, supported task scheduling for efficient load distribution and balancing, and consisted of a layered parallel software libraries that allow image processing applications to share the common functionalities. We evaluated the performance of the HPC 3D-MIP platform by applying it to computationally intensive processes in virtual colonoscopy. Experimental results showed a 12-fold performance improvement on a workstation with 12-core CPUs over the original sequential implementation of the processes, indicating the efficiency of the platform. Analysis of performance scalability based on the Amdahl's law for symmetric multicore chips showed the potential of a high performance scalability of the HPC 3D-MIP platform when a larger number of cores is available.
Scalable Packet Classification with Hash Tables
Wang, Pi-Chung
In the last decade, the technique of packet classification has been widely deployed in various network devices, including routers, firewalls and network intrusion detection systems. In this work, we improve the performance of packet classification by using multiple hash tables. The existing hash-based algorithms have superior scalability with respect to the required space; however, their search performance may not be comparable to other algorithms. To improve the search performance, we propose a tuple reordering algorithm to minimize the number of accessed hash tables with the aid of bitmaps. We also use pre-computation to ensure the accuracy of our search procedure. Performance evaluation based on both real and synthetic filter databases shows that our scheme is effective and scalable and the pre-computation cost is moderate.
Scalable fabrication of perovskite solar cells
Energy Technology Data Exchange (ETDEWEB)
Li, Zhen; Klein, Talysa R.; Kim, Dong Hoe; Yang, Mengjin; Berry, Joseph J.; van Hest, Maikel F. A. M.; Zhu, Kai
2018-03-27
Perovskite materials use earth-abundant elements, have low formation energies for deposition and are compatible with roll-to-roll and other high-volume manufacturing techniques. These features make perovskite solar cells (PSCs) suitable for terawatt-scale energy production with low production costs and low capital expenditure. Demonstrations of performance comparable to that of other thin-film photovoltaics (PVs) and improvements in laboratory-scale cell stability have recently made scale up of this PV technology an intense area of research focus. Here, we review recent progress and challenges in scaling up PSCs and related efforts to enable the terawatt-scale manufacturing and deployment of this PV technology. We discuss common device and module architectures, scalable deposition methods and progress in the scalable deposition of perovskite and charge-transport layers. We also provide an overview of device and module stability, module-level characterization techniques and techno-economic analyses of perovskite PV modules.
Parallel ray tracing for one-dimensional discrete ordinate computations
International Nuclear Information System (INIS)
Jarvis, R.D.; Nelson, P.
1996-01-01
The ray-tracing sweep in discrete-ordinates, spatially discrete numerical approximation methods applied to the linear, steady-state, plane-parallel, mono-energetic, azimuthally symmetric, neutral-particle transport equation can be reduced to a parallel prefix computation. In so doing, the often severe penalty in convergence rate of the source iteration, suffered by most current parallel algorithms using spatial domain decomposition, can be avoided while attaining parallelism in the spatial domain to whatever extent desired. In addition, the reduction implies parallel algorithm complexity limits for the ray-tracing sweep. The reduction applies to all closed, linear, one-cell functional (CLOF) spatial approximation methods, which encompasses most in current popular use. Scalability test results of an implementation of the algorithm on a 64-node nCube-2S hypercube-connected, message-passing, multi-computer are described. (author)
SMARTS: Exploiting Temporal Locality and Parallelism through Vertical Execution
International Nuclear Information System (INIS)
Beckman, P.; Crotinger, J.; Karmesin, S.; Malony, A.; Oldehoeft, R.; Shende, S.; Smith, S.; Vajracharya, S.
1999-01-01
In the solution of large-scale numerical prob- lems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely com- plex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of manag- ing parallelism and data locality from the user. We present innovative algorithms, based on the macro -dataflow model, for detecting data parallelism and efficiently executing data- parallel statements on shared-memory multiprocessors. We also desclibe how these algorithms can be implemented on clusters of SMPS
SMARTS: Exploiting Temporal Locality and Parallelism through Vertical Execution
Energy Technology Data Exchange (ETDEWEB)
Beckman, P.; Crotinger, J.; Karmesin, S.; Malony, A.; Oldehoeft, R.; Shende, S.; Smith, S.; Vajracharya, S.
1999-01-04
In the solution of large-scale numerical prob- lems, parallel computing is becoming simultaneously more important and more difficult. The complex organization of today's multiprocessors with several memory hierarchies has forced the scientific programmer to make a choice between simple but unscalable code and scalable but extremely com- plex code that does not port to other architectures. This paper describes how the SMARTS runtime system and the POOMA C++ class library for high-performance scientific computing work together to exploit data parallelism in scientific applications while hiding the details of manag- ing parallelism and data locality from the user. We present innovative algorithms, based on the macro -dataflow model, for detecting data parallelism and efficiently executing data- parallel statements on shared-memory multiprocessors. We also desclibe how these algorithms can be implemented on clusters of SMPS.
Parallel science and engineering applications the Charm++ approach
Kale, Laxmikant V
2016-01-01
Developed in the context of science and engineering applications, with each abstraction motivated by and further honed by specific application needs, Charm++ is a production-quality system that runs on almost all parallel computers available. Parallel Science and Engineering Applications: The Charm++ Approach surveys a diverse and scalable collection of science and engineering applications, most of which are used regularly on supercomputers by scientists to further their research. After a brief introduction to Charm++, the book presents several parallel CSE codes written in the Charm++ model, along with their underlying scientific and numerical formulations, explaining their parallelization strategies and parallel performance. These chapters demonstrate the versatility of Charm++ and its utility for a wide variety of applications, including molecular dynamics, cosmology, quantum chemistry, fracture simulations, agent-based simulations, and weather modeling. The book is intended for a wide audience of people i...
Scalable manufacturing processes with soft materials
White, Edward; Case, Jennifer; Kramer, Rebecca
2014-01-01
The emerging field of soft robotics will benefit greatly from new scalable manufacturing techniques for responsive materials. Currently, most of soft robotic examples are fabricated one-at-a-time, using techniques borrowed from lithography and 3D printing to fabricate molds. This limits both the maximum and minimum size of robots that can be fabricated, and hinders batch production, which is critical to gain wider acceptance for soft robotic systems. We have identified electrical structures, ...
Architecture Knowledge for Evaluating Scalable Databases
2015-01-16
Architecture Knowledge for Evaluating Scalable Databases 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) Nurgaliev... Scala , Erlang, Javascript Cursor-based queries Supported, Not Supported JOIN queries Supported, Not Supported Complex data types Lists, maps, sets...is therefore needed, using technology such as machine learning to extract content from product documentation. The terminology used in the database
Randomized Algorithms for Scalable Machine Learning
Kleiner, Ariel Jacob
2012-01-01
Many existing procedures in machine learning and statistics are computationally intractable in the setting of large-scale data. As a result, the advent of rapidly increasing dataset sizes, which should be a boon yielding improved statistical performance, instead severely blunts the usefulness of a variety of existing inferential methods. In this work, we use randomness to ameliorate this lack of scalability by reducing complex, computationally difficult inferential problems to larger sets o...
Bitcoin-NG: A Scalable Blockchain Protocol
Eyal, Ittay; Gencer, Adem Efe; Sirer, Emin Gun; van Renesse, Robbert
2015-01-01
Cryptocurrencies, based on and led by Bitcoin, have shown promise as infrastructure for pseudonymous online payments, cheap remittance, trustless digital asset exchange, and smart contracts. However, Bitcoin-derived blockchain protocols have inherent scalability limits that trade-off between throughput and latency and withhold the realization of this potential. This paper presents Bitcoin-NG, a new blockchain protocol designed to scale. Based on Bitcoin's blockchain protocol, Bitcoin-NG is By...
Scuba: scalable kernel-based gene prioritization.
Zampieri, Guido; Tran, Dinh Van; Donini, Michele; Navarin, Nicolò; Aiolli, Fabio; Sperduti, Alessandro; Valle, Giorgio
2018-01-25
The uncovering of genes linked to human diseases is a pressing challenge in molecular biology and precision medicine. This task is often hindered by the large number of candidate genes and by the heterogeneity of the available information. Computational methods for the prioritization of candidate genes can help to cope with these problems. In particular, kernel-based methods are a powerful resource for the integration of heterogeneous biological knowledge, however, their practical implementation is often precluded by their limited scalability. We propose Scuba, a scalable kernel-based method for gene prioritization. It implements a novel multiple kernel learning approach, based on a semi-supervised perspective and on the optimization of the margin distribution. Scuba is optimized to cope with strongly unbalanced settings where known disease genes are few and large scale predictions are required. Importantly, it is able to efficiently deal both with a large amount of candidate genes and with an arbitrary number of data sources. As a direct consequence of scalability, Scuba integrates also a new efficient strategy to select optimal kernel parameters for each data source. We performed cross-validation experiments and simulated a realistic usage setting, showing that Scuba outperforms a wide range of state-of-the-art methods. Scuba achieves state-of-the-art performance and has enhanced scalability compared to existing kernel-based approaches for genomic data. This method can be useful to prioritize candidate genes, particularly when their number is large or when input data is highly heterogeneous. The code is freely available at https://github.com/gzampieri/Scuba .
Parallel Programming with Intel Parallel Studio XE
Blair-Chappell , Stephen
2012-01-01
Optimize code for multi-core processors with Intel's Parallel Studio Parallel programming is rapidly becoming a "must-know" skill for developers. Yet, where to start? This teach-yourself tutorial is an ideal starting point for developers who already know Windows C and C++ and are eager to add parallelism to their code. With a focus on applying tools, techniques, and language extensions to implement parallelism, this essential resource teaches you how to write programs for multicore and leverage the power of multicore in your programs. Sharing hands-on case studies and real-world examples, the
DISP: Optimizations towards Scalable MPI Startup
Energy Technology Data Exchange (ETDEWEB)
Fu, Huansong [Florida State University, Tallahassee; Pophale, Swaroop S [ORNL; Gorentla Venkata, Manjunath [ORNL; Yu, Weikuan [Florida State University, Tallahassee
2016-01-01
Despite the popularity of MPI for high performance computing, the startup of MPI programs faces a scalability challenge as both the execution time and memory consumption increase drastically at scale. We have examined this problem using the collective modules of Cheetah and Tuned in Open MPI as representative implementations. Previous improvements for collectives have focused on algorithmic advances and hardware off-load. In this paper, we examine the startup cost of the collective module within a communicator and explore various techniques to improve its efficiency and scalability. Accordingly, we have developed a new scalable startup scheme with three internal techniques, namely Delayed Initialization, Module Sharing and Prediction-based Topology Setup (DISP). Our DISP scheme greatly benefits the collective initialization of the Cheetah module. At the same time, it helps boost the performance of non-collective initialization in the Tuned module. We evaluate the performance of our implementation on Titan supercomputer at ORNL with up to 4096 processes. The results show that our delayed initialization can speed up the startup of Tuned and Cheetah by an average of 32.0% and 29.2%, respectively, our module sharing can reduce the memory consumption of Tuned and Cheetah by up to 24.1% and 83.5%, respectively, and our prediction-based topology setup can speed up the startup of Cheetah by up to 80%.
Scalable robotic biofabrication of tissue spheroids
International Nuclear Information System (INIS)
Mehesz, A Nagy; Hajdu, Z; Visconti, R P; Markwald, R R; Mironov, V; Brown, J; Beaver, W; Da Silva, J V L
2011-01-01
Development of methods for scalable biofabrication of uniformly sized tissue spheroids is essential for tissue spheroid-based bioprinting of large size tissue and organ constructs. The most recent scalable technique for tissue spheroid fabrication employs a micromolded recessed template prepared in a non-adhesive hydrogel, wherein the cells loaded into the template self-assemble into tissue spheroids due to gravitational force. In this study, we present an improved version of this technique. A new mold was designed to enable generation of 61 microrecessions in each well of a 96-well plate. The microrecessions were seeded with cells using an EpMotion 5070 automated pipetting machine. After 48 h of incubation, tissue spheroids formed at the bottom of each microrecession. To assess the quality of constructs generated using this technology, 600 tissue spheroids made by this method were compared with 600 spheroids generated by the conventional hanging drop method. These analyses showed that tissue spheroids fabricated by the micromolded method are more uniform in diameter. Thus, use of micromolded recessions in a non-adhesive hydrogel, combined with automated cell seeding, is a reliable method for scalable robotic fabrication of uniform-sized tissue spheroids.
Scalable robotic biofabrication of tissue spheroids
Energy Technology Data Exchange (ETDEWEB)
Mehesz, A Nagy; Hajdu, Z; Visconti, R P; Markwald, R R; Mironov, V [Advanced Tissue Biofabrication Center, Department of Regenerative Medicine and Cell Biology, Medical University of South Carolina, Charleston, SC (United States); Brown, J [Department of Mechanical Engineering, Clemson University, Clemson, SC (United States); Beaver, W [York Technical College, Rock Hill, SC (United States); Da Silva, J V L, E-mail: mironovv@musc.edu [Renato Archer Information Technology Center-CTI, Campinas (Brazil)
2011-06-15
Development of methods for scalable biofabrication of uniformly sized tissue spheroids is essential for tissue spheroid-based bioprinting of large size tissue and organ constructs. The most recent scalable technique for tissue spheroid fabrication employs a micromolded recessed template prepared in a non-adhesive hydrogel, wherein the cells loaded into the template self-assemble into tissue spheroids due to gravitational force. In this study, we present an improved version of this technique. A new mold was designed to enable generation of 61 microrecessions in each well of a 96-well plate. The microrecessions were seeded with cells using an EpMotion 5070 automated pipetting machine. After 48 h of incubation, tissue spheroids formed at the bottom of each microrecession. To assess the quality of constructs generated using this technology, 600 tissue spheroids made by this method were compared with 600 spheroids generated by the conventional hanging drop method. These analyses showed that tissue spheroids fabricated by the micromolded method are more uniform in diameter. Thus, use of micromolded recessions in a non-adhesive hydrogel, combined with automated cell seeding, is a reliable method for scalable robotic fabrication of uniform-sized tissue spheroids.
A Scalable Gaussian Process Analysis Algorithm for Biomass Monitoring
Energy Technology Data Exchange (ETDEWEB)
Chandola, Varun [ORNL; Vatsavai, Raju [ORNL
2011-01-01
Biomass monitoring is vital for studying the carbon cycle of earth's ecosystem and has several significant implications, especially in the context of understanding climate change and its impacts. Recently, several change detection methods have been proposed to identify land cover changes in temporal profiles (time series) of vegetation collected using remote sensing instruments, but do not satisfy one or both of the two requirements of the biomass monitoring problem, i.e., {\\em operating in online mode} and {\\em handling periodic time series}. In this paper, we adapt Gaussian process regression to detect changes in such time series in an online fashion. While Gaussian process (GP) have been widely used as a kernel based learning method for regression and classification, their applicability to massive spatio-temporal data sets, such as remote sensing data, has been limited owing to the high computational costs involved. We focus on addressing the scalability issues associated with the proposed GP based change detection algorithm. This paper makes several significant contributions. First, we propose a GP based online time series change detection algorithm and demonstrate its effectiveness in detecting different types of changes in {\\em Normalized Difference Vegetation Index} (NDVI) data obtained from a study area in Iowa, USA. Second, we propose an efficient Toeplitz matrix based solution which significantly improves the computational complexity and memory requirements of the proposed GP based method. Specifically, the proposed solution can analyze a time series of length $t$ in $O(t^2)$ time while maintaining a $O(t)$ memory footprint, compared to the $O(t^3)$ time and $O(t^2)$ memory requirement of standard matrix manipulation based methods. Third, we describe a parallel version of the proposed solution which can be used to simultaneously analyze a large number of time series. We study three different parallel implementations: using threads, MPI, and a
Mapakshi, N. K.; Chang, J.; Nakshatrala, K. B.
2018-04-01
Mathematical models for flow through porous media typically enjoy the so-called maximum principles, which place bounds on the pressure field. It is highly desirable to preserve these bounds on the pressure field in predictive numerical simulations, that is, one needs to satisfy discrete maximum principles (DMP). Unfortunately, many of the existing formulations for flow through porous media models do not satisfy DMP. This paper presents a robust, scalable numerical formulation based on variational inequalities (VI), to model non-linear flows through heterogeneous, anisotropic porous media without violating DMP. VI is an optimization technique that places bounds on the numerical solutions of partial differential equations. To crystallize the ideas, a modification to Darcy equations by taking into account pressure-dependent viscosity will be discretized using the lowest-order Raviart-Thomas (RT0) and Variational Multi-scale (VMS) finite element formulations. It will be shown that these formulations violate DMP, and, in fact, these violations increase with an increase in anisotropy. It will be shown that the proposed VI-based formulation provides a viable route to enforce DMP. Moreover, it will be shown that the proposed formulation is scalable, and can work with any numerical discretization and weak form. A series of numerical benchmark problems are solved to demonstrate the effects of heterogeneity, anisotropy and non-linearity on DMP violations under the two chosen formulations (RT0 and VMS), and that of non-linearity on solver convergence for the proposed VI-based formulation. Parallel scalability on modern computational platforms will be illustrated through strong-scaling studies, which will prove the efficiency of the proposed formulation in a parallel setting. Algorithmic scalability as the problem size is scaled up will be demonstrated through novel static-scaling studies. The performed static-scaling studies can serve as a guide for users to be able to select
Numeric Analysis for Relationship-Aware Scalable Streaming Scheme
Directory of Open Access Journals (Sweden)
Heung Ki Lee
2014-01-01
Full Text Available Frequent packet loss of media data is a critical problem that degrades the quality of streaming services over mobile networks. Packet loss invalidates frames containing lost packets and other related frames at the same time. Indirect loss caused by losing packets decreases the quality of streaming. A scalable streaming service can decrease the amount of dropped multimedia resulting from a single packet loss. Content providers typically divide one large media stream into several layers through a scalable streaming service and then provide each scalable layer to the user depending on the mobile network. Also, a scalable streaming service makes it possible to decode partial multimedia data depending on the relationship between frames and layers. Therefore, a scalable streaming service provides a way to decrease the wasted multimedia data when one packet is lost. However, the hierarchical structure between frames and layers of scalable streams determines the service quality of the scalable streaming service. Even if whole packets of layers are transmitted successfully, they cannot be decoded as a result of the absence of reference frames and layers. Therefore, the complicated relationship between frames and layers in a scalable stream increases the volume of abandoned layers. For providing a high-quality scalable streaming service, we choose a proper relationship between scalable layers as well as the amount of transmitted multimedia data depending on the network situation. We prove that a simple scalable scheme outperforms a complicated scheme in an error-prone network. We suggest an adaptive set-top box (AdaptiveSTB to lower the dependency between scalable layers in a scalable stream. Also, we provide a numerical model to obtain the indirect loss of multimedia data and apply it to various multimedia streams. Our AdaptiveSTB enhances the quality of a scalable streaming service by removing indirect loss.
A task parallel implementation of fast multipole methods
Taura, Kenjiro
2012-11-01
This paper describes a task parallel implementation of ExaFMM, an open source implementation of fast multipole methods (FMM), using a lightweight task parallel library MassiveThreads. Although there have been many attempts on parallelizing FMM, experiences have almost exclusively been limited to formulation based on flat homogeneous parallel loops. FMM in fact contains operations that cannot be readily expressed in such conventional but restrictive models. We show that task parallelism, or parallel recursions in particular, allows us to parallelize all operations of FMM naturally and scalably. Moreover it allows us to parallelize a \\'\\'mutual interaction\\'\\' for force/potential evaluation, which is roughly twice as efficient as a more conventional, unidirectional force/potential evaluation. The net result is an open source FMM that is clearly among the fastest single node implementations, including those on GPUs; with a million particles on a 32 cores Sandy Bridge 2.20GHz node, it completes a single time step including tree construction and force/potential evaluation in 65 milliseconds. The study clearly showcases both programmability and performance benefits of flexible parallel constructs over more monolithic parallel loops. © 2012 IEEE.
An Introduction to Parallelism, Concurrency and Acceleration (1/2)
CERN. Geneva
2016-01-01
Concurrency and parallelism are firm elements of any modern computing infrastructure, made even more prominent by the emergence of accelerators. These lectures offer an introduction to these important concepts. We will begin with a brief refresher of recent hardware offerings to modern-day programmers. We will then open the main discussion with an overview of the laws and practical aspects of scalability. Key parallelism data structures, patterns and algorithms will be shown. The main threats to scalability and mitigation strategies will be discussed in the context of real-life optimization problems. Lecturer's short bio: Andrzej Nowak has 10 years of experience in computing technologies, primarily from CERN openlab and Intel. At CERN, he managed a research lab collaborating with Intel and was part of the openlab Chief Technology Office. Andrzej also worked closely and initiated projects with the private sector (e.g. HP and Google), as well as international research institutes, such as EPFL. Current...
Parallel Auxiliary Space AMG Solver for $H(div)$ Problems
Energy Technology Data Exchange (ETDEWEB)
Kolev, Tzanio V. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States); Vassilevski, Panayot S. [Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
2012-12-18
We present a family of scalable preconditioners for matrices arising in the discretization of $H(div)$ problems using the lowest order Raviart--Thomas finite elements. Our approach belongs to the class of “auxiliary space''--based methods and requires only the finite element stiffness matrix plus some minimal additional discretization information about the topology and orientation of mesh entities. Also, we provide a detailed algebraic description of the theory, parallel implementation, and different variants of this parallel auxiliary space divergence solver (ADS) and discuss its relations to the Hiptmair--Xu (HX) auxiliary space decomposition of $H(div)$ [SIAM J. Numer. Anal., 45 (2007), pp. 2483--2509] and to the auxiliary space Maxwell solver AMS [J. Comput. Math., 27 (2009), pp. 604--623]. Finally, an extensive set of numerical experiments demonstrates the robustness and scalability of our implementation on large-scale $H(div)$ problems with large jumps in the material coefficients.
Scalability Dilemma and Statistic Multiplexed Computing — A Theory and Experiment
Directory of Open Access Journals (Sweden)
Justin Yuan Shi
2017-08-01
Full Text Available The For the last three decades, end-to-end computing paradigms, such as MPI (Message Passing Interface, RPC (Remote Procedure Call and RMI (Remote Method Invocation, have been the de facto paradigms for distributed and parallel programming. Despite of the successes, applications built using these paradigms suffer due to the proportionality factor of crash in the application with its size. Checkpoint/restore and backup/recovery are the only means to save otherwise lost critical information. The scalability dilemma is such a practical challenge that the probability of the data losses increases as the application scales in size. The theoretical significance of this practical challenge is that it undermines the fundamental structure of the scientific discovery process and mission critical services in production today. In 1997, the direct use of end-to-end reference model in distributed programming was recognized as a fallacy. The scalability dilemma was predicted. However, this voice was overrun by the passage of time. Today, the rapidly growing digitized data demands solving the increasingly critical scalability challenges. Computing architecture scalability, although loosely defined, is now the front and center of large-scale computing efforts. Constrained only by the economic law of diminishing returns, this paper proposes a narrow definition of a Scalable Computing Service (SCS. Three scalability tests are also proposed in order to distinguish service architecture flaws from poor application programming. Scalable data intensive service requires additional treatments. Thus, the data storage is assumed reliable in this paper. A single-sided Statistic Multiplexed Computing (SMC paradigm is proposed. A UVR (Unidirectional Virtual Ring SMC architecture is examined under SCS tests. SMC was designed to circumvent the well-known impossibility of end-to-end paradigms. It relies on the proven statistic multiplexing principle to deliver reliable service
Parallelization of pressure equation solver for incompressible N-S equations
International Nuclear Information System (INIS)
Ichihara, Kiyoshi; Yokokawa, Mitsuo; Kaburaki, Hideo.
1996-03-01
A pressure equation solver in a code for 3-dimensional incompressible flow analysis has been parallelized by using red-black SOR method and PCG method on Fujitsu VPP500, a vector parallel computer with distributed memory. For the comparison of scalability, the solver using the red-black SOR method has been also parallelized on the Intel Paragon, a scalar parallel computer with a distributed memory. The scalability of the red-black SOR method on both VPP500 and Paragon was lost, when number of processor elements was increased. The reason of non-scalability on both systems is increasing communication time between processor elements. In addition, the parallelization by DO-loop division makes the vectorizing efficiency lower on VPP500. For an effective implementation on VPP500, a large scale problem which holds very long vectorized DO-loops in the parallel program should be solved. PCG method with red-black SOR method applied to incomplete LU factorization (red-black PCG) has more iteration steps than normal PCG method with forward and backward substitution, in spite of same number of the floating point operations in a DO-loop of incomplete LU factorization. The parallelized red-black PCG method has less merits than the parallelized red-black SOR method when the computational region has fewer grids, because the low vectorization efficiency is obtained in red-black PCG method. (author)
Morse, H Stephen
1994-01-01
Practical Parallel Computing provides information pertinent to the fundamental aspects of high-performance parallel processing. This book discusses the development of parallel applications on a variety of equipment.Organized into three parts encompassing 12 chapters, this book begins with an overview of the technology trends that converge to favor massively parallel hardware over traditional mainframes and vector machines. This text then gives a tutorial introduction to parallel hardware architectures. Other chapters provide worked-out examples of programs using several parallel languages. Thi
Akl, Selim G
1985-01-01
Parallel Sorting Algorithms explains how to use parallel algorithms to sort a sequence of items on a variety of parallel computers. The book reviews the sorting problem, the parallel models of computation, parallel algorithms, and the lower bounds on the parallel sorting problems. The text also presents twenty different algorithms, such as linear arrays, mesh-connected computers, cube-connected computers. Another example where algorithm can be applied is on the shared-memory SIMD (single instruction stream multiple data stream) computers in which the whole sequence to be sorted can fit in the
Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash
2003-01-01
Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.
BCYCLIC: A parallel block tridiagonal matrix cyclic solver
Hirshman, S. P.; Perumalla, K. S.; Lynch, V. E.; Sanchez, R.
2010-09-01
A block tridiagonal matrix is factored with minimal fill-in using a cyclic reduction algorithm that is easily parallelized. Storage of the factored blocks allows the application of the inverse to multiple right-hand sides which may not be known at factorization time. Scalability with the number of block rows is achieved with cyclic reduction, while scalability with the block size is achieved using multithreaded routines (OpenMP, GotoBLAS) for block matrix manipulation. This dual scalability is a noteworthy feature of this new solver, as well as its ability to efficiently handle arbitrary (non-powers-of-2) block row and processor numbers. Comparison with a state-of-the art parallel sparse solver is presented. It is expected that this new solver will allow many physical applications to optimally use the parallel resources on current supercomputers. Example usage of the solver in magneto-hydrodynamic (MHD), three-dimensional equilibrium solvers for high-temperature fusion plasmas is cited.
The Scalable Coherent Interface and related standards projects
International Nuclear Information System (INIS)
Gustavson, D.B.
1991-09-01
The Scalable Coherent Interface (SCI) project (IEEE P1596) found a way to avoid the limits that are inherent in bus technology. SCI provides bus-like services by transmitting packets on a collection of point-to-point unidirectional links. The SCI protocols support cache coherence in a distributed-shared-memory multiprocessor model, message passing, I/O, and local-area-network-like communication over fiber optic or wire links. VLSI circuits that operate parallel links at 1000 MByte/s and serial links at 1000 Mbit/s will be available early in 1992. Several ongoing SCI-related projects are applying the SCI technology to new areas or extending it to more difficult problems. P1596.1 defines the architecture of a bridge between SCI and VME; P1596.2 compatibly extends the cache coherence mechanism for efficient operation with kiloprocessor systems; P1596.3 defines new low-voltage (about 0.25 V) differential signals suitable for low power interfaces for CMOS or GaAs VLSI implementations of SCI; P1596.4 defines a high performance memory chip interface using these signals; P1596.5 defines data transfer formats for efficient interprocessor communication in heterogeneous multiprocessor systems. This paper reports the current status of SCI, related standards, and new projects. 16 refs
Parallel computing in enterprise modeling.
Energy Technology Data Exchange (ETDEWEB)
Goldsby, Michael E.; Armstrong, Robert C.; Shneider, Max S.; Vanderveen, Keith; Ray, Jaideep; Heath, Zach; Allan, Benjamin A.
2008-08-01
This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.
Parallel implementations of 2D explicit Euler solvers
International Nuclear Information System (INIS)
Giraud, L.; Manzini, G.
1996-01-01
In this work we present a subdomain partitioning strategy applied to an explicit high-resolution Euler solver. We describe the design of a portable parallel multi-domain code suitable for parallel environments. We present several implementations on a representative range of MlMD computers that include shared memory multiprocessors, distributed virtual shared memory computers, as well as networks of workstations. Computational results are given to illustrate the efficiency, the scalability, and the limitations of the different approaches. We discuss also the effect of the communication protocol on the optimal domain partitioning strategy for the distributed memory computers
Parallel Computing Characteristics of Two-Phase Thermal-Hydraulics code, CUPID
International Nuclear Information System (INIS)
Lee, Jae Ryong; Yoon, Han Young
2013-01-01
Parallelized CUPID code has proved to be able to reproduce multi-dimensional thermal hydraulic analysis by validating with various conceptual problems and experimental data. In this paper, the characteristics of the parallelized CUPID code were investigated. Both single- and two phase simulation are taken into account. Since the scalability of a parallel simulation is known to be better for fine mesh system, two types of mesh system are considered. In addition, the dependency of the preconditioner for matrix solver was also compared. The scalability for the single-phase flow is better than that for two-phase flow due to the less numbers of iterations for solving pressure matrix. The CUPID code was investigated the parallel performance in terms of scalability. The CUPID code was parallelized with domain decomposition method. The MPI library was adopted to communicate the information at the interface cells. As increasing the number of mesh, the scalability is improved. For a given mesh, single-phase flow simulation with diagonal preconditioner shows the best speedup. However, for the two-phase flow simulation, the ILU preconditioner is recommended since it reduces the overall simulation time
Scalable and reusable emulator for evaluating the performance of SS7 networks
Lazar, Aurel A.; Tseng, Kent H.; Lim, Koon Seng; Choe, Winston
1994-04-01
A scalable and reusable emulator was designed and implemented for studying the behavior of SS7 networks. The emulator design was largely based on public domain software. It was developed on top of an environment supported by PVM, the Parallel Virtual Machine, and managed by OSIMIS-the OSI Management Information Service platform. The emulator runs on top of a commercially available ATM LAN interconnecting engineering workstations. As a case study for evaluating the emulator, the behavior of the Singapore National SS7 Network under fault and unbalanced loading conditions was investigated.
Naeem, Raeece
2012-11-28
Summary: READSCAN is a highly scalable parallel program to identify non-host sequences (of potential pathogen origin) and estimate their genome relative abundance in high-throughput sequence datasets. READSCAN accurately classified human and viral sequences on a 20.1 million reads simulated dataset in <27 min using a small Beowulf compute cluster with 16 nodes (Supplementary Material). Availability: http://cbrc.kaust.edu.sa/readscan Contact: or raeece.naeem@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. 2012 The Author(s).
An ODMG-compatible testbed architecture for scalable management and analysis of physics data
International Nuclear Information System (INIS)
Malon, D.M.; May, E.N.
1997-01-01
This paper describes a testbed architecture for the investigation and development of scalable approaches to the management and analysis of massive amounts of high energy physics data. The architecture has two components: an interface layer that is compliant with a substantial subset of the ODMG-93 Version 1.2 specification, and a lightweight object persistence manager that provides flexible storage and retrieval services on a variety of single- and multi-level storage architectures, and on a range of parallel and distributed computing platforms
Naeem, Raeece; Rashid, Mamoon; Pain, Arnab
2012-01-01
Summary: READSCAN is a highly scalable parallel program to identify non-host sequences (of potential pathogen origin) and estimate their genome relative abundance in high-throughput sequence datasets. READSCAN accurately classified human and viral sequences on a 20.1 million reads simulated dataset in <27 min using a small Beowulf compute cluster with 16 nodes (Supplementary Material). Availability: http://cbrc.kaust.edu.sa/readscan Contact: or raeece.naeem@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. 2012 The Author(s).
Performance studies of the parallel VIM code
International Nuclear Information System (INIS)
Shi, B.; Blomquist, R.N.
1996-01-01
In this paper, the authors evaluate the performance of the parallel version of the VIM Monte Carlo code on the IBM SPx at the High Performance Computing Research Facility at ANL. Three test problems with contrasting computational characteristics were used to assess effects in performance. A statistical method for estimating the inefficiencies due to load imbalance and communication is also introduced. VIM is a large scale continuous energy Monte Carlo radiation transport program and was parallelized using history partitioning, the master/worker approach, and p4 message passing library. Dynamic load balancing is accomplished when the master processor assigns chunks of histories to workers that have completed a previously assigned task, accommodating variations in the lengths of histories, processor speeds, and worker loads. At the end of each batch (generation), the fission sites and tallies are sent from each worker to the master process, contributing to the parallel inefficiency. All communications are between master and workers, and are serial. The SPx is a scalable 128-node parallel supercomputer with high-performance Omega switches of 63 microsec latency and 35 MBytes/sec bandwidth. For uniform and reproducible performance, they used only the 120 identical regular processors (IBM RS/6000) and excluded the remaining eight planet nodes, which may be loaded by other's jobs
Programming Scala Scalability = Functional Programming + Objects
Wampler, Dean
2009-01-01
Learn how to be more productive with Scala, a new multi-paradigm language for the Java Virtual Machine (JVM) that integrates features of both object-oriented and functional programming. With this book, you'll discover why Scala is ideal for highly scalable, component-based applications that support concurrency and distribution. Programming Scala clearly explains the advantages of Scala as a JVM language. You'll learn how to leverage the wealth of Java class libraries to meet the practical needs of enterprise and Internet projects more easily. Packed with code examples, this book provides us
Tip-Based Nanofabrication for Scalable Manufacturing
Directory of Open Access Journals (Sweden)
Huan Hu
2017-03-01
Full Text Available Tip-based nanofabrication (TBN is a family of emerging nanofabrication techniques that use a nanometer scale tip to fabricate nanostructures. In this review, we first introduce the history of the TBN and the technology development. We then briefly review various TBN techniques that use different physical or chemical mechanisms to fabricate features and discuss some of the state-of-the-art techniques. Subsequently, we focus on those TBN methods that have demonstrated potential to scale up the manufacturing throughput. Finally, we discuss several research directions that are essential for making TBN a scalable nano-manufacturing technology.
Tip-Based Nanofabrication for Scalable Manufacturing
International Nuclear Information System (INIS)
Hu, Huan; Somnath, Suhas
2017-01-01
Tip-based nanofabrication (TBN) is a family of emerging nanofabrication techniques that use a nanometer scale tip to fabricate nanostructures. Here in this review, we first introduce the history of the TBN and the technology development. We then briefly review various TBN techniques that use different physical or chemical mechanisms to fabricate features and discuss some of the state-of-the-art techniques. Subsequently, we focus on those TBN methods that have demonstrated potential to scale up the manufacturing throughput. Finally, we discuss several research directions that are essential for making TBN a scalable nano-manufacturing technology.
Towards a Scalable, Biomimetic, Antibacterial Coating
Dickson, Mary Nora
Corneal afflictions are the second leading cause of blindness worldwide. When a corneal transplant is unavailable or contraindicated, an artificial cornea device is the only chance to save sight. Bacterial or fungal biofilm build up on artificial cornea devices can lead to serious complications including the need for systemic antibiotic treatment and even explantation. As a result, much emphasis has been placed on anti-adhesion chemical coatings and antibiotic leeching coatings. These methods are not long-lasting, and microorganisms can eventually circumvent these measures. Thus, I have developed a surface topographical antimicrobial coating. Various surface structures including rough surfaces, superhydrophobic surfaces, and the natural surfaces of insects' wings and sharks' skin are promising anti-biofilm candidates, however none meet the criteria necessary for implementation on the surface of an artificial cornea device. In this thesis I: 1) developed scalable fabrication protocols for a library of biomimetic nanostructure polymer surfaces 2) assessed the potential these for poly(methyl methacrylate) nanopillars to kill or prevent formation of biofilm by E. coli bacteria and species of Pseudomonas and Staphylococcus bacteria and improved upon a proposed mechanism for the rupture of Gram-negative bacterial cell walls 3) developed a scalable, commercially viable method for producing antibacterial nanopillars on a curved, PMMA artificial cornea device and 4) developed scalable fabrication protocols for implantation of antibacterial nanopatterned surfaces on the surfaces of thermoplastic polyurethane materials, commonly used in catheter tubings. This project constitutes a first step towards fabrication of the first entirely PMMA artificial cornea device. The major finding of this work is that by precisely controlling the topography of a polymer surface at the nano-scale, we can kill adherent bacteria and prevent biofilm formation of certain pathogenic bacteria
Scalable Optical-Fiber Communication Networks
Chow, Edward T.; Peterson, John C.
1993-01-01
Scalable arbitrary fiber extension network (SAFEnet) is conceptual fiber-optic communication network passing digital signals among variety of computers and input/output devices at rates from 200 Mb/s to more than 100 Gb/s. Intended for use with very-high-speed computers and other data-processing and communication systems in which message-passing delays must be kept short. Inherent flexibility makes it possible to match performance of network to computers by optimizing configuration of interconnections. In addition, interconnections made redundant to provide tolerance to faults.
Scalable Tensor Factorizations with Missing Data
DEFF Research Database (Denmark)
Acar, Evrim; Dunlavy, Daniel M.; Kolda, Tamara G.
2010-01-01
of missing data, many important data sets will be discarded or improperly analyzed. Therefore, we need a robust and scalable approach for factorizing multi-way arrays (i.e., tensors) in the presence of missing data. We focus on one of the most well-known tensor factorizations, CANDECOMP/PARAFAC (CP...... is shown to successfully factor tensors with noise and up to 70% missing data. Moreover, our approach is significantly faster than the leading alternative and scales to larger problems. To show the real-world usefulness of CP-WOPT, we illustrate its applicability on a novel EEG (electroencephalogram...
Scalable and Anonymous Group Communication with MTor
Directory of Open Access Journals (Sweden)
Lin Dong
2016-04-01
Full Text Available This paper presents MTor, a low-latency anonymous group communication system. We construct MTor as an extension to Tor, allowing the construction of multi-source multicast trees on top of the existing Tor infrastructure. MTor does not depend on an external service to broker the group communication, and avoids central points of failure and trust. MTor’s substantial bandwidth savings and graceful scalability enable new classes of anonymous applications that are currently too bandwidth-intensive to be viable through traditional unicast Tor communication-e.g., group file transfer, collaborative editing, streaming video, and real-time audio conferencing.
Grassmann Averages for Scalable Robust PCA
DEFF Research Database (Denmark)
Hauberg, Søren; Feragen, Aasa; Black, Michael J.
2014-01-01
As the collection of large datasets becomes increasingly automated, the occurrence of outliers will increase—“big data” implies “big outliers”. While principal component analysis (PCA) is often used to reduce the size of data, and scalable solutions exist, it is well-known that outliers can...... to vectors (subspaces) or elements of vectors; we focus on the latter and use a trimmed average. The resulting Trimmed Grassmann Average (TGA) is particularly appropriate for computer vision because it is robust to pixel outliers. The algorithm has low computational complexity and minimal memory requirements...
Introduction to parallel programming
Brawer, Steven
1989-01-01
Introduction to Parallel Programming focuses on the techniques, processes, methodologies, and approaches involved in parallel programming. The book first offers information on Fortran, hardware and operating system models, and processes, shared memory, and simple parallel programs. Discussions focus on processes and processors, joining processes, shared memory, time-sharing with multiple processors, hardware, loops, passing arguments in function/subroutine calls, program structure, and arithmetic expressions. The text then elaborates on basic parallel programming techniques, barriers and race
Fox, Geoffrey C; Messina, Guiseppe C
2014-01-01
A clear illustration of how parallel computers can be successfully appliedto large-scale scientific computations. This book demonstrates how avariety of applications in physics, biology, mathematics and other scienceswere implemented on real parallel computers to produce new scientificresults. It investigates issues of fine-grained parallelism relevant forfuture supercomputers with particular emphasis on hypercube architecture. The authors describe how they used an experimental approach to configuredifferent massively parallel machines, design and implement basic systemsoftware, and develop
Scalability Optimization of Seamless Positioning Service
Directory of Open Access Journals (Sweden)
Juraj Machaj
2016-01-01
Full Text Available Recently positioning services are getting more attention not only within research community but also from service providers. From the service providers point of view positioning service that will be able to work seamlessly in all environments, for example, indoor, dense urban, and rural, has a huge potential to open new markets. However, such system does not only need to provide accurate position estimates but have to be scalable and resistant to fake positioning requests. In the previous works we have proposed a modular system, which is able to provide seamless positioning in various environments. The system automatically selects optimal positioning module based on available radio signals. The system currently consists of three positioning modules—GPS, GSM based positioning, and Wi-Fi based positioning. In this paper we will propose algorithm which will reduce time needed for position estimation and thus allow higher scalability of the modular system and thus allow providing positioning services to higher amount of users. Such improvement is extremely important, for real world application where large number of users will require position estimates, since positioning error is affected by response time of the positioning server.
Algorithmic psychometrics and the scalable subject.
Stark, Luke
2018-04-01
Recent public controversies, ranging from the 2014 Facebook 'emotional contagion' study to psychographic data profiling by Cambridge Analytica in the 2016 American presidential election, Brexit referendum and elsewhere, signal watershed moments in which the intersecting trajectories of psychology and computer science have become matters of public concern. The entangled history of these two fields grounds the application of applied psychological techniques to digital technologies, and an investment in applying calculability to human subjectivity. Today, a quantifiable psychological subject position has been translated, via 'big data' sets and algorithmic analysis, into a model subject amenable to classification through digital media platforms. I term this position the 'scalable subject', arguing it has been shaped and made legible by algorithmic psychometrics - a broad set of affordances in digital platforms shaped by psychology and the behavioral sciences. In describing the contours of this 'scalable subject', this paper highlights the urgent need for renewed attention from STS scholars on the psy sciences, and on a computational politics attentive to psychology, emotional expression, and sociality via digital media.
Towards Scalable Graph Computation on Mobile Devices.
Chen, Yiqi; Lin, Zhiyuan; Pienta, Robert; Kahng, Minsuk; Chau, Duen Horng
2014-10-01
Mobile devices have become increasingly central to our everyday activities, due to their portability, multi-touch capabilities, and ever-improving computational power. Such attractive features have spurred research interest in leveraging mobile devices for computation. We explore a novel approach that aims to use a single mobile device to perform scalable graph computation on large graphs that do not fit in the device's limited main memory, opening up the possibility of performing on-device analysis of large datasets, without relying on the cloud. Based on the familiar memory mapping capability provided by today's mobile operating systems, our approach to scale up computation is powerful and intentionally kept simple to maximize its applicability across the iOS and Android platforms. Our experiments demonstrate that an iPad mini can perform fast computation on large real graphs with as many as 272 million edges (Google+ social graph), at a speed that is only a few times slower than a 13″ Macbook Pro. Through creating a real world iOS app with this technique, we demonstrate the strong potential application for scalable graph computation on a single mobile device using our approach.
Computational scalability of large size image dissemination
Kooper, Rob; Bajcsy, Peter
2011-01-01
We have investigated the computational scalability of image pyramid building needed for dissemination of very large image data. The sources of large images include high resolution microscopes and telescopes, remote sensing and airborne imaging, and high resolution scanners. The term 'large' is understood from a user perspective which means either larger than a display size or larger than a memory/disk to hold the image data. The application drivers for our work are digitization projects such as the Lincoln Papers project (each image scan is about 100-150MB or about 5000x8000 pixels with the total number to be around 200,000) and the UIUC library scanning project for historical maps from 17th and 18th century (smaller number but larger images). The goal of our work is understand computational scalability of the web-based dissemination using image pyramids for these large image scans, as well as the preservation aspects of the data. We report our computational benchmarks for (a) building image pyramids to be disseminated using the Microsoft Seadragon library, (b) a computation execution approach using hyper-threading to generate image pyramids and to utilize the underlying hardware, and (c) an image pyramid preservation approach using various hard drive configurations of Redundant Array of Independent Disks (RAID) drives for input/output operations. The benchmarks are obtained with a map (334.61 MB, JPEG format, 17591x15014 pixels). The discussion combines the speed and preservation objectives.
Towards Scalable Graph Computation on Mobile Devices
Chen, Yiqi; Lin, Zhiyuan; Pienta, Robert; Kahng, Minsuk; Chau, Duen Horng
2015-01-01
Mobile devices have become increasingly central to our everyday activities, due to their portability, multi-touch capabilities, and ever-improving computational power. Such attractive features have spurred research interest in leveraging mobile devices for computation. We explore a novel approach that aims to use a single mobile device to perform scalable graph computation on large graphs that do not fit in the device's limited main memory, opening up the possibility of performing on-device analysis of large datasets, without relying on the cloud. Based on the familiar memory mapping capability provided by today's mobile operating systems, our approach to scale up computation is powerful and intentionally kept simple to maximize its applicability across the iOS and Android platforms. Our experiments demonstrate that an iPad mini can perform fast computation on large real graphs with as many as 272 million edges (Google+ social graph), at a speed that is only a few times slower than a 13″ Macbook Pro. Through creating a real world iOS app with this technique, we demonstrate the strong potential application for scalable graph computation on a single mobile device using our approach. PMID:25859564
Big data integration: scalability and sustainability
Zhang, Zhang
2016-01-26
Integration of various types of omics data is critically indispensable for addressing most important and complex biological questions. In the era of big data, however, data integration becomes increasingly tedious, time-consuming and expensive, posing a significant obstacle to fully exploit the wealth of big biological data. Here we propose a scalable and sustainable architecture that integrates big omics data through community-contributed modules. Community modules are contributed and maintained by different committed groups and each module corresponds to a specific data type, deals with data collection, processing and visualization, and delivers data on-demand via web services. Based on this community-based architecture, we build Information Commons for Rice (IC4R; http://ic4r.org), a rice knowledgebase that integrates a variety of rice omics data from multiple community modules, including genome-wide expression profiles derived entirely from RNA-Seq data, resequencing-based genomic variations obtained from re-sequencing data of thousands of rice varieties, plant homologous genes covering multiple diverse plant species, post-translational modifications, rice-related literatures, and community annotations. Taken together, such architecture achieves integration of different types of data from multiple community-contributed modules and accordingly features scalable, sustainable and collaborative integration of big data as well as low costs for database update and maintenance, thus helpful for building IC4R into a comprehensive knowledgebase covering all aspects of rice data and beneficial for both basic and translational researches.
Domain decomposition method of stochastic PDEs: a two-level scalable preconditioner
International Nuclear Information System (INIS)
Subber, Waad; Sarkar, Abhijit
2012-01-01
For uncertainty quantification in many practical engineering problems, the stochastic finite element method (SFEM) may be computationally challenging. In SFEM, the size of the algebraic linear system grows rapidly with the spatial mesh resolution and the order of the stochastic dimension. In this paper, we describe a non-overlapping domain decomposition method, namely the iterative substructuring method to tackle the large-scale linear system arising in the SFEM. The SFEM is based on domain decomposition in the geometric space and a polynomial chaos expansion in the probabilistic space. In particular, a two-level scalable preconditioner is proposed for the iterative solver of the interface problem for the stochastic systems. The preconditioner is equipped with a coarse problem which globally connects the subdomains both in the geometric and probabilistic spaces via their corner nodes. This coarse problem propagates the information quickly across the subdomains leading to a scalable preconditioner. For numerical illustrations, a two-dimensional stochastic elliptic partial differential equation (SPDE) with spatially varying non-Gaussian random coefficients is considered. The numerical scalability of the the preconditioner is investigated with respect to the mesh size, subdomain size, fixed problem size per subdomain and order of polynomial chaos expansion. The numerical experiments are performed on a Linux cluster using MPI and PETSc parallel libraries.
Scalable Strategies for Computing with Massive Data
Directory of Open Access Journals (Sweden)
Michael Kane
2013-11-01
Full Text Available This paper presents two complementary statistical computing frameworks that address challenges in parallel processing and the analysis of massive data. First, the foreach package allows users of the R programming environment to define parallel loops that may be run sequentially on a single machine, in parallel on a symmetric multiprocessing (SMP machine, or in cluster environments without platform-specific code. Second, the bigmemory package implements memory- and file-mapped data structures that provide (a access to arbitrarily large data while retaining a look and feel that is familiar to R users and (b data structures that are shared across processor cores in order to support efficient parallel computing techniques. Although these packages may be used independently, this paper shows how they can be used in combination to address challenges that have effectively been beyond the reach of researchers who lack specialized software development skills or expensive hardware.
Scalable conditional induction variables (CIV) analysis
Oancea, Cosmin E.; Rauchwerger, Lawrence
2015-01-01
challenges to automatic parallelization. Because the complexity of such induction variables is often due to their conditional evaluation across the iteration space of loops we name them Conditional Induction Variables (CIV). This paper presents a flow
Scalable Learning for Geostatistics and Speaker Recognition
2011-01-01
Device Architecture (CUDA)[63], a parallel programming model that leverages the parallel compute engine in NVIDIA GPUs to solve general purpose...validation. 3.1 Geospatial data reconstruction Sensors deployed on satellites are often used to collect enviromental data where a direct measurement is...same decision as training a model on B and testing on A, which is desirable in many recognition engines . We shall address this in the next chapter. The
Parallel computing of physical maps--a comparative study in SIMD and MIMD parallelism.
Bhandarkar, S M; Chirravuri, S; Arnold, J
1996-01-01
Ordering clones from a genomic library into physical maps of whole chromosomes presents a central computational problem in genetics. Chromosome reconstruction via clone ordering is usually isomorphic to the NP-complete Optimal Linear Arrangement problem. Parallel SIMD and MIMD algorithms for simulated annealing based on Markov chain distribution are proposed and applied to the problem of chromosome reconstruction via clone ordering. Perturbation methods and problem-specific annealing heuristics are proposed and described. The SIMD algorithms are implemented on a 2048 processor MasPar MP-2 system which is an SIMD 2-D toroidal mesh architecture whereas the MIMD algorithms are implemented on an 8 processor Intel iPSC/860 which is an MIMD hypercube architecture. A comparative analysis of the various SIMD and MIMD algorithms is presented in which the convergence, speedup, and scalability characteristics of the various algorithms are analyzed and discussed. On a fine-grained, massively parallel SIMD architecture with a low synchronization overhead such as the MasPar MP-2, a parallel simulated annealing algorithm based on multiple periodically interacting searches performs the best. For a coarse-grained MIMD architecture with high synchronization overhead such as the Intel iPSC/860, a parallel simulated annealing algorithm based on multiple independent searches yields the best results. In either case, distribution of clonal data across multiple processors is shown to exacerbate the tendency of the parallel simulated annealing algorithm to get trapped in a local optimum.
Parallel Atomistic Simulations
Energy Technology Data Exchange (ETDEWEB)
HEFFELFINGER,GRANT S.
2000-01-18
Algorithms developed to enable the use of atomistic molecular simulation methods with parallel computers are reviewed. Methods appropriate for bonded as well as non-bonded (and charged) interactions are included. While strategies for obtaining parallel molecular simulations have been developed for the full variety of atomistic simulation methods, molecular dynamics and Monte Carlo have received the most attention. Three main types of parallel molecular dynamics simulations have been developed, the replicated data decomposition, the spatial decomposition, and the force decomposition. For Monte Carlo simulations, parallel algorithms have been developed which can be divided into two categories, those which require a modified Markov chain and those which do not. Parallel algorithms developed for other simulation methods such as Gibbs ensemble Monte Carlo, grand canonical molecular dynamics, and Monte Carlo methods for protein structure determination are also reviewed and issues such as how to measure parallel efficiency, especially in the case of parallel Monte Carlo algorithms with modified Markov chains are discussed.
Benefits of Parallel I/O in Ab Initio Nuclear Physics Calculations
International Nuclear Information System (INIS)
Laghave, Nikhil; Sosonkina, Masha; Maris, Pieter; Vary, James P.
2009-01-01
Many modern scientific applications rely on highly parallel calculations, which scale to 10's of thousands processors. However, most applications do not concentrate on parallelizing input/output operations. In particular, sequential I/O has been identified as a bottleneck for the highly scalable MFDn (Many Fermion Dynamics for nuclear structure) code performing ab initio nuclear structure calculations. In this paper, we develop interfaces and parallel I/O procedures to use a well-known parallel I/O library in MFDn. As a result, we gain efficient input/output of large datasets along with their portability and ease of use in the downstream processing.
Race: A scalable and elastic parallel system for discovering repeats in very long sequences
Mansour, Essam; El-Roby, Ahmed; Kalnis, Panos; Ahmadia, Aron J.; Aboulnaga, Ashraf
2013-01-01
-efficient implementation. RACE is particularly suitable for the cloud (e.g., Amazon EC2) because, based on availability, it can scale elastically to more or fewer machines during its execution. Since scaling out introduces overheads, mainly due to load imbalance, we
Scalable Algorithms for Parallel Discrete Event Simulation Systems in Multicore Environments
2013-05-01
consolidated at the sender side. At the receiver side, the messages are deconsolidated and delivered to the appropriate thread. This approach bears some...Jiang, S. Kini, W. Yu, D. Buntinas, P. Wyckoff, and D. Panda . Performance comparison of mpi implementations over infiniband, myrinet and quadrics
GPU-FS-kNN: a software tool for fast and scalable kNN computation using GPUs.
Directory of Open Access Journals (Sweden)
Ahmed Shamsul Arefin
Full Text Available BACKGROUND: The analysis of biological networks has become a major challenge due to the recent development of high-throughput techniques that are rapidly producing very large data sets. The exploding volumes of biological data are craving for extreme computational power and special computing facilities (i.e. super-computers. An inexpensive solution, such as General Purpose computation based on Graphics Processing Units (GPGPU, can be adapted to tackle this challenge, but the limitation of the device internal memory can pose a new problem of scalability. An efficient data and computational parallelism with partitioning is required to provide a fast and scalable solution to this problem. RESULTS: We propose an efficient parallel formulation of the k-Nearest Neighbour (kNN search problem, which is a popular method for classifying objects in several fields of research, such as pattern recognition, machine learning and bioinformatics. Being very simple and straightforward, the performance of the kNN search degrades dramatically for large data sets, since the task is computationally intensive. The proposed approach is not only fast but also scalable to large-scale instances. Based on our approach, we implemented a software tool GPU-FS-kNN (GPU-based Fast and Scalable k-Nearest Neighbour for CUDA enabled GPUs. The basic approach is simple and adaptable to other available GPU architectures. We observed speed-ups of 50-60 times compared with CPU implementation on a well-known breast microarray study and its associated data sets. CONCLUSION: Our GPU-based Fast and Scalable k-Nearest Neighbour search technique (GPU-FS-kNN provides a significant performance improvement for nearest neighbour computation in large-scale networks. Source code and the software tool is available under GNU Public License (GPL at https://sourceforge.net/p/gpufsknn/.
Parallel linear solvers for simulations of reactor thermal hydraulics
International Nuclear Information System (INIS)
Yan, Y.; Antal, S.P.; Edge, B.; Keyes, D.E.; Shaver, D.; Bolotnov, I.A.; Podowski, M.Z.
2011-01-01
The state-of-the-art multiphase fluid dynamics code, NPHASE-CMFD, performs multiphase flow simulations in complex domains using implicit nonlinear treatment of the governing equations and in parallel, which is a very challenging environment for the linear solver. The present work illustrates how the Portable, Extensible Toolkit for Scientific Computation (PETSc) and scalable Algebraic Multigrid (AMG) preconditioner from Hypre can be utilized to construct robust and scalable linear solvers for the Newton correction equation obtained from the discretized system of governing conservation equations in NPHASE-CMFD. The overall long-tem objective of this work is to extend the NPHASE-CMFD code into a fully-scalable solver of multiphase flow and heat transfer problems, applicable to both steady-state and stiff time-dependent phenomena in complete fuel assemblies of nuclear reactors and, eventually, the entire reactor core (such as the Virtual Reactor concept envisioned by CASL). This campaign appropriately begins with the linear algebraic equation solver, which is traditionally a bottleneck to scalability in PDE-based codes. The computational complexity of the solver is usually superlinear in problem size, whereas the rest of the code, the “physics” portion, usually has its complexity linear in the problem size. (author)
Scalable Faceted Ranking in Tagging Systems
Orlicki, José I.; Alvarez-Hamelin, J. Ignacio; Fierens, Pablo I.
Nowadays, web collaborative tagging systems which allow users to upload, comment on and recommend contents, are growing. Such systems can be represented as graphs where nodes correspond to users and tagged-links to recommendations. In this paper we analyze the problem of computing a ranking of users with respect to a facet described as a set of tags. A straightforward solution is to compute a PageRank-like algorithm on a facet-related graph, but it is not feasible for online computation. We propose an alternative: (i) a ranking for each tag is computed offline on the basis of tag-related subgraphs; (ii) a faceted order is generated online by merging rankings corresponding to all the tags in the facet. Based on the graph analysis of YouTube and Flickr, we show that step (i) is scalable. We also present efficient algorithms for step (ii), which are evaluated by comparing their results with two gold standards.
A graph algebra for scalable visual analytics.
Shaverdian, Anna A; Zhou, Hao; Michailidis, George; Jagadish, Hosagrahar V
2012-01-01
Visual analytics (VA), which combines analytical techniques with advanced visualization features, is fast becoming a standard tool for extracting information from graph data. Researchers have developed many tools for this purpose, suggesting a need for formal methods to guide these tools' creation. Increased data demands on computing requires redesigning VA tools to consider performance and reliability in the context of analysis of exascale datasets. Furthermore, visual analysts need a way to document their analyses for reuse and results justification. A VA graph framework encapsulated in a graph algebra helps address these needs. Its atomic operators include selection and aggregation. The framework employs a visual operator and supports dynamic attributes of data to enable scalable visual exploration of data.
Scalable group level probabilistic sparse factor analysis
DEFF Research Database (Denmark)
Hinrich, Jesper Løve; Nielsen, Søren Føns Vind; Riis, Nicolai Andre Brogaard
2017-01-01
Many data-driven approaches exist to extract neural representations of functional magnetic resonance imaging (fMRI) data, but most of them lack a proper probabilistic formulation. We propose a scalable group level probabilistic sparse factor analysis (psFA) allowing spatially sparse maps, component...... pruning using automatic relevance determination (ARD) and subject specific heteroscedastic spatial noise modeling. For task-based and resting state fMRI, we show that the sparsity constraint gives rise to components similar to those obtained by group independent component analysis. The noise modeling...... shows that noise is reduced in areas typically associated with activation by the experimental design. The psFA model identifies sparse components and the probabilistic setting provides a natural way to handle parameter uncertainties. The variational Bayesian framework easily extends to more complex...
Scalable on-chip quantum state tomography
Titchener, James G.; Gräfe, Markus; Heilmann, René; Solntsev, Alexander S.; Szameit, Alexander; Sukhorukov, Andrey A.
2018-03-01
Quantum information systems are on a path to vastly exceed the complexity of any classical device. The number of entangled qubits in quantum devices is rapidly increasing, and the information required to fully describe these systems scales exponentially with qubit number. This scaling is the key benefit of quantum systems, however it also presents a severe challenge. To characterize such systems typically requires an exponentially long sequence of different measurements, becoming highly resource demanding for large numbers of qubits. Here we propose and demonstrate a novel and scalable method for characterizing quantum systems based on expanding a multi-photon state to larger dimensionality. We establish that the complexity of this new measurement technique only scales linearly with the number of qubits, while providing a tomographically complete set of data without a need for reconfigurability. We experimentally demonstrate an integrated photonic chip capable of measuring two- and three-photon quantum states with statistical reconstruction fidelity of 99.71%.
A versatile scalable PET processing system
International Nuclear Information System (INIS)
Dong, H.; Weisenberger, A.; McKisson, J.; Wenze, Xi; Cuevas, C.; Wilson, J.; Zukerman, L.
2011-01-01
Positron Emission Tomography (PET) historically has major clinical and preclinical applications in cancerous oncology, neurology, and cardiovascular diseases. Recently, in a new direction, an application specific PET system is being developed at Thomas Jefferson National Accelerator Facility (Jefferson Lab) in collaboration with Duke University, University of Maryland at Baltimore (UMAB), and West Virginia University (WVU) targeted for plant eco-physiology research. The new plant imaging PET system is versatile and scalable such that it could adapt to several plant imaging needs - imaging many important plant organs including leaves, roots, and stems. The mechanical arrangement of the detectors is designed to accommodate the unpredictable and random distribution in space of the plant organs without requiring the plant be disturbed. Prototyping such a system requires a new data acquisition system (DAQ) and data processing system which are adaptable to the requirements of these unique and versatile detectors.
The Concept of Business Model Scalability
DEFF Research Database (Denmark)
Nielsen, Christian; Lund, Morten
2015-01-01
The power of business models lies in their ability to visualize and clarify how firms’ may configure their value creation processes. Among the key aspects of business model thinking are a focus on what the customer values, how this value is best delivered to the customer and how strategic partners...... are leveraged in this value creation, delivery and realization exercise. Central to the mainstream understanding of business models is the value proposition towards the customer and the hypothesis generated is that if the firm delivers to the customer what he/she requires, then there is a good foundation...... for a long-term profitable business. However, the message conveyed in this article is that while providing a good value proposition may help the firm ‘get by’, the really successful businesses of today are those able to reach the sweet-spot of business model scalability. This article introduces and discusses...
The scalable coherent interface, IEEE P1596
International Nuclear Information System (INIS)
Gustavson, D.B.
1990-01-01
IEEE P1596, the scalable coherent interface (formerly known as SuperBus) is based on experience gained while developing Fastbus (ANSI/IEEE 960--1986, IEC 935), Futurebus (IEEE P896.x) and other modern 32-bit buses. SCI goals include a minimum bandwidth of 1 GByte/sec per processor in multiprocessor systems with thousands of processors; efficient support of a coherent distributed-cache image of distributed shared memory; support for repeaters which interface to existing or future buses; and support for inexpensive small rings as well as for general switched interconnections like Banyan, Omega, or crossbar networks. This paper presents a summary of current directions, reports the status of the work in progress, and suggests some applications in data acquisition and physics
BASSET: Scalable Gateway Finder in Large Graphs
Energy Technology Data Exchange (ETDEWEB)
Tong, H; Papadimitriou, S; Faloutsos, C; Yu, P S; Eliassi-Rad, T
2010-11-03
Given a social network, who is the best person to introduce you to, say, Chris Ferguson, the poker champion? Or, given a network of people and skills, who is the best person to help you learn about, say, wavelets? The goal is to find a small group of 'gateways': persons who are close enough to us, as well as close enough to the target (person, or skill) or, in other words, are crucial in connecting us to the target. The main contributions are the following: (a) we show how to formulate this problem precisely; (b) we show that it is sub-modular and thus it can be solved near-optimally; (c) we give fast, scalable algorithms to find such gateways. Experiments on real data sets validate the effectiveness and efficiency of the proposed methods, achieving up to 6,000,000x speedup.
Scalable quantum search using trapped ions
International Nuclear Information System (INIS)
Ivanov, S. S.; Ivanov, P. A.; Linington, I. E.; Vitanov, N. V.
2010-01-01
We propose a scalable implementation of Grover's quantum search algorithm in a trapped-ion quantum information processor. The system is initialized in an entangled Dicke state by using adiabatic techniques. The inversion-about-average and oracle operators take the form of single off-resonant laser pulses. This is made possible by utilizing the physical symmetries of the trapped-ion linear crystal. The physical realization of the algorithm represents a dramatic simplification: each logical iteration (oracle and inversion about average) requires only two physical interaction steps, in contrast to the large number of concatenated gates required by previous approaches. This not only facilitates the implementation but also increases the overall fidelity of the algorithm.
Scalable graphene aptasensors for drug quantification
Vishnubhotla, Ramya; Ping, Jinglei; Gao, Zhaoli; Lee, Abigail; Saouaf, Olivia; Vrudhula, Amey; Johnson, A. T. Charlie
2017-11-01
Simpler and more rapid approaches for therapeutic drug-level monitoring are highly desirable to enable use at the point-of-care. We have developed an all-electronic approach for detection of the HIV drug tenofovir based on scalable fabrication of arrays of graphene field-effect transistors (GFETs) functionalized with a commercially available DNA aptamer. The shift in the Dirac voltage of the GFETs varied systematically with the concentration of tenofovir in deionized water, with a detection limit less than 1 ng/mL. Tests against a set of negative controls confirmed the specificity of the sensor response. This approach offers the potential for further development into a rapid and convenient point-of-care tool with clinically relevant performance.
Konduri, Aditya
Many natural and engineering systems are governed by nonlinear partial differential equations (PDEs) which result in a multiscale phenomena, e.g. turbulent flows. Numerical simulations of these problems are computationally very expensive and demand for extreme levels of parallelism. At realistic conditions, simulations are being carried out on massively parallel computers with hundreds of thousands of processing elements (PEs). It has been observed that communication between PEs as well as their synchronization at these extreme scales take up a significant portion of the total simulation time and result in poor scalability of codes. This issue is likely to pose a bottleneck in scalability of codes on future Exascale systems. In this work, we propose an asynchronous computing algorithm based on widely used finite difference methods to solve PDEs in which synchronization between PEs due to communication is relaxed at a mathematical level. We show that while stability is conserved when schemes are used asynchronously, accuracy is greatly degraded. Since message arrivals at PEs are random processes, so is the behavior of the error. We propose a new statistical framework in which we show that average errors drop always to first-order regardless of the original scheme. We propose new asynchrony-tolerant schemes that maintain accuracy when synchronization is relaxed. The quality of the solution is shown to depend, not only on the physical phenomena and numerical schemes, but also on the characteristics of the computing machine. A novel algorithm using remote memory access communications has been developed to demonstrate excellent scalability of the method for large-scale computing. Finally, we present a path to extend this method in solving complex multi-scale problems on Exascale machines.
Lundquist, Carol; Frieder, Ophir; Holmes, David O.; Grossman, David
1999-01-01
Describes a scalable, parallel, relational database-drive information retrieval engine. To support portability across a wide range of execution environments, all algorithms adhere to the SQL-92 standard. By incorporating relevance feedback algorithms, accuracy is enhanced over prior database-driven information retrieval efforts. Presents…
Scalable Transactions for Web Applications in the Cloud
Zhou, W.; Pierre, G.E.O.; Chi, C.-H.
2009-01-01
Cloud Computing platforms provide scalability and high availability properties for web applications but they sacrifice data consistency at the same time. However, many applications cannot afford any data inconsistency. We present a scalable transaction manager for NoSQL cloud database services to
New Complexity Scalable MPEG Encoding Techniques for Mobile Applications
Directory of Open Access Journals (Sweden)
Stephan Mietens
2004-03-01
Full Text Available Complexity scalability offers the advantage of one-time design of video applications for a large product family, including mobile devices, without the need of redesigning the applications on the algorithmic level to meet the requirements of the different products. In this paper, we present complexity scalable MPEG encoding having core modules with modifications for scalability. The interdependencies of the scalable modules and the system performance are evaluated. Experimental results show scalability giving a smooth change in complexity and corresponding video quality. Scalability is basically achieved by varying the number of computed DCT coefficients and the number of evaluated motion vectors but other modules are designed such they scale with the previous parameters. In the experiments using the Ã‚Â“StefanÃ‚Â” sequence, the elapsed execution time of the scalable encoder, reflecting the computational complexity, can be gradually reduced to roughly 50% of its original execution time. The video quality scales between 20 dB and 48 dB PSNR with unity quantizer setting, and between 21.5 dB and 38.5 dB PSNR for different sequences targeting 1500 kbps. The implemented encoder and the scalability techniques can be successfully applied in mobile systems based on MPEG video compression.
Building scalable apps with Redis and Node.js
Johanan, Joshua
2014-01-01
If the phrase scalability sounds alien to you, then this is an ideal book for you. You will not need much Node.js experience as each framework is demonstrated in a way that requires no previous knowledge of the framework. You will be building scalable Node.js applications in no time! Knowledge of JavaScript is required.
Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R.; Bock, Davi D.; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C.; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R. Clay; Smith, Stephen J.; Szalay, Alexander S.; Vogelstein, Joshua T.; Vogelstein, R. Jacob
2013-01-01
We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes— neural connectivity maps of the brain—using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems—reads to parallel disk arrays and writes to solid-state storage—to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization. PMID:24401992
Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R; Bock, Davi D; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R Clay; Smith, Stephen J; Szalay, Alexander S; Vogelstein, Joshua T; Vogelstein, R Jacob
2013-01-01
We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes - neural connectivity maps of the brain-using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems-reads to parallel disk arrays and writes to solid-state storage-to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization.
Fourier transform based scalable image quality measure.
Narwaria, Manish; Lin, Weisi; McLoughlin, Ian; Emmanuel, Sabu; Chia, Liang-Tien
2012-08-01
We present a new image quality assessment (IQA) algorithm based on the phase and magnitude of the 2D (twodimensional) Discrete Fourier Transform (DFT). The basic idea is to compare the phase and magnitude of the reference and distorted images to compute the quality score. However, it is well known that the Human Visual Systems (HVSs) sensitivity to different frequency components is not the same. We accommodate this fact via a simple yet effective strategy of nonuniform binning of the frequency components. This process also leads to reduced space representation of the image thereby enabling the reduced-reference (RR) prospects of the proposed scheme. We employ linear regression to integrate the effects of the changes in phase and magnitude. In this way, the required weights are determined via proper training and hence more convincing and effective. Lastly, using the fact that phase usually conveys more information than magnitude, we use only the phase for RR quality assessment. This provides the crucial advantage of further reduction in the required amount of reference image information. The proposed method is therefore further scalable for RR scenarios. We report extensive experimental results using a total of 9 publicly available databases: 7 image (with a total of 3832 distorted images with diverse distortions) and 2 video databases (totally 228 distorted videos). These show that the proposed method is overall better than several of the existing fullreference (FR) algorithms and two RR algorithms. Additionally, there is a graceful degradation in prediction performance as the amount of reference image information is reduced thereby confirming its scalability prospects. To enable comparisons and future study, a Matlab implementation of the proposed algorithm is available at http://www.ntu.edu.sg/home/wslin/reduced_phase.rar.
Improving diabetes medication adherence: successful, scalable interventions
Directory of Open Access Journals (Sweden)
Zullig LL
2015-01-01
Full Text Available Leah L Zullig,1,2 Walid F Gellad,3,4 Jivan Moaddeb,2,5 Matthew J Crowley,1,2 William Shrank,6 Bradi B Granger,7 Christopher B Granger,8 Troy Trygstad,9 Larry Z Liu,10 Hayden B Bosworth1,2,7,11 1Center for Health Services Research in Primary Care, Durham Veterans Affairs Medical Center, Durham, NC, USA; 2Department of Medicine, Duke University, Durham, NC, USA; 3Center for Health Equity Research and Promotion, Pittsburgh Veterans Affairs Medical Center, Pittsburgh, PA, USA; 4Division of General Internal Medicine, University of Pittsburgh, Pittsburgh, PA, USA; 5Institute for Genome Sciences and Policy, Duke University, Durham, NC, USA; 6CVS Caremark Corporation; 7School of Nursing, Duke University, Durham, NC, USA; 8Department of Medicine, Division of Cardiology, Duke University School of Medicine, Durham, NC, USA; 9North Carolina Community Care Networks, Raleigh, NC, USA; 10Pfizer, Inc., and Weill Medical College of Cornell University, New York, NY, USA; 11Department of Psychiatry and Behavioral Sciences, Duke University School of Medicine, Durham, NC, USA Abstract: Effective medications are a cornerstone of prevention and disease treatment, yet only about half of patients take their medications as prescribed, resulting in a common and costly public health challenge for the US healthcare system. Since poor medication adherence is a complex problem with many contributing causes, there is no one universal solution. This paper describes interventions that were not only effective in improving medication adherence among patients with diabetes, but were also potentially scalable (ie, easy to implement to a large population. We identify key characteristics that make these interventions effective and scalable. This information is intended to inform healthcare systems seeking proven, low resource, cost-effective solutions to improve medication adherence. Keywords: medication adherence, diabetes mellitus, chronic disease, dissemination research
CERN. Geneva
2016-01-01
The traditionally used and well established parallel programming models OpenMP and MPI are both targeting lower level parallelism and are meant to be as language agnostic as possible. For a long time, those models were the only widely available portable options for developing parallel C++ applications beyond using plain threads. This has strongly limited the optimization capabilities of compilers, has inhibited extensibility and genericity, and has restricted the use of those models together with other, modern higher level abstractions introduced by the C++11 and C++14 standards. The recent revival of interest in the industry and wider community for the C++ language has also spurred a remarkable amount of standardization proposals and technical specifications being developed. Those efforts however have so far failed to build a vision on how to seamlessly integrate various types of parallelism, such as iterative parallel execution, task-based parallelism, asynchronous many-task execution flows, continuation s...
Parallelism in matrix computations
Gallopoulos, Efstratios; Sameh, Ahmed H
2016-01-01
This book is primarily intended as a research monograph that could also be used in graduate courses for the design of parallel algorithms in matrix computations. It assumes general but not extensive knowledge of numerical linear algebra, parallel architectures, and parallel programming paradigms. The book consists of four parts: (I) Basics; (II) Dense and Special Matrix Computations; (III) Sparse Matrix Computations; and (IV) Matrix functions and characteristics. Part I deals with parallel programming paradigms and fundamental kernels, including reordering schemes for sparse matrices. Part II is devoted to dense matrix computations such as parallel algorithms for solving linear systems, linear least squares, the symmetric algebraic eigenvalue problem, and the singular-value decomposition. It also deals with the development of parallel algorithms for special linear systems such as banded ,Vandermonde ,Toeplitz ,and block Toeplitz systems. Part III addresses sparse matrix computations: (a) the development of pa...
Scalable and Media Aware Adaptive Video Streaming over Wireless Networks
Directory of Open Access Journals (Sweden)
Béatrice Pesquet-Popescu
2008-07-01
Full Text Available This paper proposes an advanced video streaming system based on scalable video coding in order to optimize resource utilization in wireless networks with retransmission mechanisms at radio protocol level. The key component of this system is a packet scheduling algorithm which operates on the different substreams of a main scalable video stream and which is implemented in a so-called media aware network element. The concerned type of transport channel is a dedicated channel subject to parameters (bitrate, loss rate variations on the long run. Moreover, we propose a combined scalability approach in which common temporal and SNR scalability features can be used jointly with a partitioning of the image into regions of interest. Simulation results show that our approach provides substantial quality gain compared to classical packet transmission methods and they demonstrate how ROI coding combined with SNR scalability allows to improve again the visual quality.
DEFF Research Database (Denmark)
Sitchinava, Nodar; Zeh, Norbert
2012-01-01
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number of...... in the optimal OhOf(psortN + K/PB) parallel I/O complexity, where K is the size of the output reported in the process and psortN is the parallel I/O complexity of sorting N elements using P processors....
Deshmane, Anagha; Gulani, Vikas; Griswold, Mark A; Seiberlich, Nicole
2012-07-01
Parallel imaging is a robust method for accelerating the acquisition of magnetic resonance imaging (MRI) data, and has made possible many new applications of MR imaging. Parallel imaging works by acquiring a reduced amount of k-space data with an array of receiver coils. These undersampled data can be acquired more quickly, but the undersampling leads to aliased images. One of several parallel imaging algorithms can then be used to reconstruct artifact-free images from either the aliased images (SENSE-type reconstruction) or from the undersampled data (GRAPPA-type reconstruction). The advantages of parallel imaging in a clinical setting include faster image acquisition, which can be used, for instance, to shorten breath-hold times resulting in fewer motion-corrupted examinations. In this article the basic concepts behind parallel imaging are introduced. The relationship between undersampling and aliasing is discussed and two commonly used parallel imaging methods, SENSE and GRAPPA, are explained in detail. Examples of artifacts arising from parallel imaging are shown and ways to detect and mitigate these artifacts are described. Finally, several current applications of parallel imaging are presented and recent advancements and promising research in parallel imaging are briefly reviewed. Copyright © 2012 Wiley Periodicals, Inc.
Parallel Algorithms and Patterns
Energy Technology Data Exchange (ETDEWEB)
Robey, Robert W. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)
2016-06-16
This is a powerpoint presentation on parallel algorithms and patterns. A parallel algorithm is a well-defined, step-by-step computational procedure that emphasizes concurrency to solve a problem. Examples of problems include: Sorting, searching, optimization, matrix operations. A parallel pattern is a computational step in a sequence of independent, potentially concurrent operations that occurs in diverse scenarios with some frequency. Examples are: Reductions, prefix scans, ghost cell updates. We only touch on parallel patterns in this presentation. It really deserves its own detailed discussion which Gabe Rockefeller would like to develop.
Application Portable Parallel Library
Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott
1995-01-01
Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.
Scalable Molecular Dynamics for Large Biomolecular Systems
Directory of Open Access Journals (Sweden)
Robert K. Brunner
2000-01-01
Full Text Available We present an optimized parallelization scheme for molecular dynamics simulations of large biomolecular systems, implemented in the production-quality molecular dynamics program NAMD. With an object-based hybrid force and spatial decomposition scheme, and an aggressive measurement-based predictive load balancing framework, we have attained speeds and speedups that are much higher than any reported in literature so far. The paper first summarizes the broad methodology we are pursuing, and the basic parallelization scheme we used. It then describes the optimizations that were instrumental in increasing performance, and presents performance results on benchmark simulations.
Large-scale parallel genome assembler over cloud computing environment.
Das, Arghya Kusum; Koppa, Praveen Kumar; Goswami, Sayan; Platania, Richard; Park, Seung-Jong
2017-06-01
The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.
Computational chaos in massively parallel neural networks
Barhen, Jacob; Gulati, Sandeep
1989-01-01
A fundamental issue which directly impacts the scalability of current theoretical neural network models to massively parallel embodiments, in both software as well as hardware, is the inherent and unavoidable concurrent asynchronicity of emerging fine-grained computational ensembles and the possible emergence of chaotic manifestations. Previous analyses attributed dynamical instability to the topology of the interconnection matrix, to parasitic components or to propagation delays. However, researchers have observed the existence of emergent computational chaos in a concurrently asynchronous framework, independent of the network topology. Researcher present a methodology enabling the effective asynchronous operation of large-scale neural networks. Necessary and sufficient conditions guaranteeing concurrent asynchronous convergence are established in terms of contracting operators. Lyapunov exponents are computed formally to characterize the underlying nonlinear dynamics. Simulation results are presented to illustrate network convergence to the correct results, even in the presence of large delays.
Scalable Adaptive Multilevel Solvers for Multiphysics Problems
Energy Technology Data Exchange (ETDEWEB)
Xu, Jinchao [Pennsylvania State Univ., University Park, PA (United States). Dept. of Mathematics
2014-11-26
In this project, we carried out many studies on adaptive and parallel multilevel methods for numerical modeling for various applications, including Magnetohydrodynamics (MHD) and complex fluids. We have made significant efforts and advances in adaptive multilevel methods of the multiphysics problems: multigrid methods, adaptive finite element methods, and applications.
Parallel discrete event simulation
Overeinder, B.J.; Hertzberger, L.O.; Sloot, P.M.A.; Withagen, W.J.
1991-01-01
In simulating applications for execution on specific computing systems, the simulation performance figures must be known in a short period of time. One basic approach to the problem of reducing the required simulation time is the exploitation of parallelism. However, in parallelizing the simulation
Parallel reservoir simulator computations
International Nuclear Information System (INIS)
Hemanth-Kumar, K.; Young, L.C.
1995-01-01
The adaptation of a reservoir simulator for parallel computations is described. The simulator was originally designed for vector processors. It performs approximately 99% of its calculations in vector/parallel mode and relative to scalar calculations it achieves speedups of 65 and 81 for black oil and EOS simulations, respectively on the CRAY C-90
Scalable and portable visualization of large atomistic datasets
Sharma, Ashish; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya
2004-10-01
A scalable and portable code named Atomsviewer has been developed to interactively visualize a large atomistic dataset consisting of up to a billion atoms. The code uses a hierarchical view frustum-culling algorithm based on the octree data structure to efficiently remove atoms outside of the user's field-of-view. Probabilistic and depth-based occlusion-culling algorithms then select atoms, which have a high probability of being visible. Finally a multiresolution algorithm is used to render the selected subset of visible atoms at varying levels of detail. Atomsviewer is written in C++ and OpenGL, and it has been tested on a number of architectures including Windows, Macintosh, and SGI. Atomsviewer has been used to visualize tens of millions of atoms on a standard desktop computer and, in its parallel version, up to a billion atoms. Program summaryTitle of program: Atomsviewer Catalogue identifier: ADUM Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADUM Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland Computer for which the program is designed and others on which it has been tested: 2.4 GHz Pentium 4/Xeon processor, professional graphics card; Apple G4 (867 MHz)/G5, professional graphics card Operating systems under which the program has been tested: Windows 2000/XP, Mac OS 10.2/10.3, SGI IRIX 6.5 Programming languages used: C++, C and OpenGL Memory required to execute with typical data: 1 gigabyte of RAM High speed storage required: 60 gigabytes No. of lines in the distributed program including test data, etc.: 550 241 No. of bytes in the distributed program including test data, etc.: 6 258 245 Number of bits in a word: Arbitrary Number of processors used: 1 Has the code been vectorized or parallelized: No Distribution format: tar gzip file Nature of physical problem: Scientific visualization of atomic systems Method of solution: Rendering of atoms using computer graphic techniques, culling algorithms for data
Totally parallel multilevel algorithms
Frederickson, Paul O.
1988-01-01
Four totally parallel algorithms for the solution of a sparse linear system have common characteristics which become quite apparent when they are implemented on a highly parallel hypercube such as the CM2. These four algorithms are Parallel Superconvergent Multigrid (PSMG) of Frederickson and McBryan, Robust Multigrid (RMG) of Hackbusch, the FFT based Spectral Algorithm, and Parallel Cyclic Reduction. In fact, all four can be formulated as particular cases of the same totally parallel multilevel algorithm, which are referred to as TPMA. In certain cases the spectral radius of TPMA is zero, and it is recognized to be a direct algorithm. In many other cases the spectral radius, although not zero, is small enough that a single iteration per timestep keeps the local error within the required tolerance.
Energy Technology Data Exchange (ETDEWEB)
1991-10-23
An account of the Caltech Concurrent Computation Program (C{sup 3}P), a five year project that focused on answering the question: Can parallel computers be used to do large-scale scientific computations '' As the title indicates, the question is answered in the affirmative, by implementing numerous scientific applications on real parallel computers and doing computations that produced new scientific results. In the process of doing so, C{sup 3}P helped design and build several new computers, designed and implemented basic system software, developed algorithms for frequently used mathematical computations on massively parallel machines, devised performance models and measured the performance of many computers, and created a high performance computing facility based exclusively on parallel computers. While the initial focus of C{sup 3}P was the hypercube architecture developed by C. Seitz, many of the methods developed and lessons learned have been applied successfully on other massively parallel architectures.
Massively parallel mathematical sieves
Energy Technology Data Exchange (ETDEWEB)
Montry, G.R.
1989-01-01
The Sieve of Eratosthenes is a well-known algorithm for finding all prime numbers in a given subset of integers. A parallel version of the Sieve is described that produces computational speedups over 800 on a hypercube with 1,024 processing elements for problems of fixed size. Computational speedups as high as 980 are achieved when the problem size per processor is fixed. The method of parallelization generalizes to other sieves and will be efficient on any ensemble architecture. We investigate two highly parallel sieves using scattered decomposition and compare their performance on a hypercube multiprocessor. A comparison of different parallelization techniques for the sieve illustrates the trade-offs necessary in the design and implementation of massively parallel algorithms for large ensemble computers.
Distributed and parallel approach for handle and perform huge datasets
Konopko, Joanna
2015-12-01
Big Data refers to the dynamic, large and disparate volumes of data comes from many different sources (tools, machines, sensors, mobile devices) uncorrelated with each others. It requires new, innovative and scalable technology to collect, host and analytically process the vast amount of data. Proper architecture of the system that perform huge data sets is needed. In this paper, the comparison of distributed and parallel system architecture is presented on the example of MapReduce (MR) Hadoop platform and parallel database platform (DBMS). This paper also analyzes the problem of performing and handling valuable information from petabytes of data. The both paradigms: MapReduce and parallel DBMS are described and compared. The hybrid architecture approach is also proposed and could be used to solve the analyzed problem of storing and processing Big Data.
Parallel alternating direction preconditioner for isogeometric simulations of explicit dynamics
Łoś, Marcin
2015-04-27
In this paper we present a parallel implementation of the alternating direction preconditioner for isogeometric simulations of explicit dynamics. The Alternating Direction Implicit (ADI) algorithm, belongs to the category of matrix-splitting iterative methods, was proposed almost six decades ago for solving parabolic and elliptic partial differential equations, see [1–4]. The new version of this algorithm has been recently developed for isogeometric simulations of two dimensional explicit dynamics [5] and steady-state diffusion equations with orthotropic heterogenous coefficients [6]. In this paper we present a parallel version of the alternating direction implicit algorithm for three dimensional simulations. The algorithm has been incorporated as a part of PETIGA an isogeometric framework [7] build on top of PETSc [8]. We show the scalability of the parallel algorithm on STAMPEDE linux cluster up to 10,000 processors, as well as the convergence rate of the PCG solver with ADI algorithm as preconditioner.
Engineering-Based Thermal CFD Simulations on Massive Parallel Systems
Frisch, Jérôme
2015-05-22
The development of parallel Computational Fluid Dynamics (CFD) codes is a challenging task that entails efficient parallelization concepts and strategies in order to achieve good scalability values when running those codes on modern supercomputers with several thousands to millions of cores. In this paper, we present a hierarchical data structure for massive parallel computations that supports the coupling of a Navier–Stokes-based fluid flow code with the Boussinesq approximation in order to address complex thermal scenarios for energy-related assessments. The newly designed data structure is specifically designed with the idea of interactive data exploration and visualization during runtime of the simulation code; a major shortcoming of traditional high-performance computing (HPC) simulation codes. We further show and discuss speed-up values obtained on one of Germany’s top-ranked supercomputers with up to 140,000 processes and present simulation results for different engineering-based thermal problems.
Parallel iterative solvers and preconditioners using approximate hierarchical methods
Energy Technology Data Exchange (ETDEWEB)
Grama, A.; Kumar, V.; Sameh, A. [Univ. of Minnesota, Minneapolis, MN (United States)
1996-12-31
In this paper, we report results of the performance, convergence, and accuracy of a parallel GMRES solver for Boundary Element Methods. The solver uses a hierarchical approximate matrix-vector product based on a hybrid Barnes-Hut / Fast Multipole Method. We study the impact of various accuracy parameters on the convergence and show that with minimal loss in accuracy, our solver yields significant speedups. We demonstrate the excellent parallel efficiency and scalability of our solver. The combined speedups from approximation and parallelism represent an improvement of several orders in solution time. We also develop fast and paralellizable preconditioners for this problem. We report on the performance of an inner-outer scheme and a preconditioner based on truncated Green`s function. Experimental results on a 256 processor Cray T3D are presented.
Oracle database performance and scalability a quantitative approach
Liu, Henry H
2011-01-01
A data-driven, fact-based, quantitative text on Oracle performance and scalability With database concepts and theories clearly explained in Oracle's context, readers quickly learn how to fully leverage Oracle's performance and scalability capabilities at every stage of designing and developing an Oracle-based enterprise application. The book is based on the author's more than ten years of experience working with Oracle, and is filled with dependable, tested, and proven performance optimization techniques. Oracle Database Performance and Scalability is divided into four parts that enable reader
Scalable-to-lossless transform domain distributed video coding
DEFF Research Database (Denmark)
Huang, Xin; Ukhanova, Ann; Veselov, Anton
2010-01-01
Distributed video coding (DVC) is a novel approach providing new features as low complexity encoding by mainly exploiting the source statistics at the decoder based on the availability of decoder side information. In this paper, scalable-tolossless DVC is presented based on extending a lossy Tran...... codec provides frame by frame encoding. Comparing the lossless coding efficiency, the proposed scalable-to-lossless TDWZ video codec can save up to 5%-13% bits compared to JPEG LS and H.264 Intra frame lossless coding and do so as a scalable-to-lossless coding....
Design issues for numerical libraries on scalable multicore architectures
International Nuclear Information System (INIS)
Heroux, M A
2008-01-01
Future generations of scalable computers will rely on multicore nodes for a significant portion of overall system performance. At present, most applications and libraries cannot exploit multiple cores beyond running addition MPI processes per node. In this paper we discuss important multicore architecture issues, programming models, algorithms requirements and software design related to effective use of scalable multicore computers. In particular, we focus on important issues for library research and development, making recommendations for how to effectively develop libraries for future scalable computer systems
Esmaily, M.; Jofre, L.; Mani, A.; Iaccarino, G.
2018-03-01
A geometric multigrid algorithm is introduced for solving nonsymmetric linear systems resulting from the discretization of the variable density Navier-Stokes equations on nonuniform structured rectilinear grids and high-Reynolds number flows. The restriction operation is defined such that the resulting system on the coarser grids is symmetric, thereby allowing for the use of efficient smoother algorithms. To achieve an optimal rate of convergence, the sequence of interpolation and restriction operations are determined through a dynamic procedure. A parallel partitioning strategy is introduced to minimize communication while maintaining the load balance between all processors. To test the proposed algorithm, we consider two cases: 1) homogeneous isotropic turbulence discretized on uniform grids and 2) turbulent duct flow discretized on stretched grids. Testing the algorithm on systems with up to a billion unknowns shows that the cost varies linearly with the number of unknowns. This O (N) behavior confirms the robustness of the proposed multigrid method regarding ill-conditioning of large systems characteristic of multiscale high-Reynolds number turbulent flows. The robustness of our method to density variations is established by considering cases where density varies sharply in space by a factor of up to 104, showing its applicability to two-phase flow problems. Strong and weak scalability studies are carried out, employing up to 30,000 processors, to examine the parallel performance of our implementation. Excellent scalability of our solver is shown for a granularity as low as 104 to 105 unknowns per processor. At its tested peak throughput, it solves approximately 4 billion unknowns per second employing over 16,000 processors with a parallel efficiency higher than 50%.
International Nuclear Information System (INIS)
Mo Zeyao
2004-11-01
Multiphysics parallel numerical simulations are usually essential to simplify researches on complex physical phenomena in which several physics are tightly coupled. It is very important on how to concatenate those coupled physics for fully scalable parallel simulation. Meanwhile, three objectives should be balanced, the first is efficient data transfer among simulations, the second and the third are efficient parallel executions and simultaneously developments of those simulation codes. Two concatenating algorithms for multiphysics parallel numerical simulations coupling radiation hydrodynamics with neutron transport on unstructured grid are presented. The first algorithm, Fully Loosely Concatenation (FLC), focuses on the independence of code development and the independence running with optimal performance of code. The second algorithm. Two Level Tightly Concatenation (TLTC), focuses on the optimal tradeoffs among above three objectives. Theoretical analyses for communicational complexity and parallel numerical experiments on hundreds of processors on two parallel machines have showed that these two algorithms are efficient and can be generalized to other multiphysics parallel numerical simulations. In especial, algorithm TLTC is linearly scalable and has achieved the optimal parallel performance. (authors)
Building Scalable Knowledge Graphs for Earth Science
Ramachandran, R.; Maskey, M.; Gatlin, P. N.; Zhang, J.; Duan, X.; Bugbee, K.; Christopher, S. A.; Miller, J. J.
2017-12-01
Estimates indicate that the world's information will grow by 800% in the next five years. In any given field, a single researcher or a team of researchers cannot keep up with this rate of knowledge expansion without the help of cognitive systems. Cognitive computing, defined as the use of information technology to augment human cognition, can help tackle large systemic problems. Knowledge graphs, one of the foundational components of cognitive systems, link key entities in a specific domain with other entities via relationships. Researchers could mine these graphs to make probabilistic recommendations and to infer new knowledge. At this point, however, there is a dearth of tools to generate scalable Knowledge graphs using existing corpus of scientific literature for Earth science research. Our project is currently developing an end-to-end automated methodology for incrementally constructing Knowledge graphs for Earth Science. Semantic Entity Recognition (SER) is one of the key steps in this methodology. SER for Earth Science uses external resources (including metadata catalogs and controlled vocabulary) as references to guide entity extraction and recognition (i.e., labeling) from unstructured text, in order to build a large training set to seed the subsequent auto-learning component in our algorithm. Results from several SER experiments will be presented as well as lessons learned.
CODA: A scalable, distributed data acquisition system
International Nuclear Information System (INIS)
Watson, W.A. III; Chen, J.; Heyes, G.; Jastrzembski, E.; Quarrie, D.
1994-01-01
A new data acquisition system has been designed for physics experiments scheduled to run at CEBAF starting in the summer of 1994. This system runs on Unix workstations connected via ethernet, FDDI, or other network hardware to multiple intelligent front end crates -- VME, CAMAC or FASTBUS. CAMAC crates may either contain intelligent processors, or may be interfaced to VME. The system is modular and scalable, from a single front end crate and one workstation linked by ethernet, to as may as 32 clusters of front end crates ultimately connected via a high speed network to a set of analysis workstations. The system includes an extensible, device independent slow controls package with drivers for CAMAC, VME, and high voltage crates, as well as a link to CEBAF accelerator controls. All distributed processes are managed by standard remote procedure calls propagating change-of-state requests, or reading and writing program variables. Custom components may be easily integrated. The system is portable to any front end processor running the VxWorks real-time kernel, and to most workstations supplying a few standard facilities such as rsh and X-windows, and Motif and socket libraries. Sample implementations exist for 2 Unix workstation families connected via ethernet or FDDI to VME (with interfaces to FASTBUS or CAMAC), and via ethernet to FASTBUS or CAMAC
Ancestors protocol for scalable key management
Directory of Open Access Journals (Sweden)
Dieter Gollmann
2010-06-01
Full Text Available Group key management is an important functional building block for secure multicast architecture. Thereby, it has been extensively studied in the literature. The main proposed protocol is Adaptive Clustering for Scalable Group Key Management (ASGK. According to ASGK protocol, the multicast group is divided into clusters, where each cluster consists of areas of members. Each cluster uses its own Traffic Encryption Key (TEK. These clusters are updated periodically depending on the dynamism of the members during the secure session. The modified protocol has been proposed based on ASGK with some modifications to balance the number of affected members and the encryption/decryption overhead with any number of the areas when a member joins or leaves the group. This modified protocol is called Ancestors protocol. According to Ancestors protocol, every area receives the dynamism of the members from its parents. The main objective of the modified protocol is to reduce the number of affected members during the leaving and joining members, then 1 affects n overhead would be reduced. A comparative study has been done between ASGK protocol and the modified protocol. According to the comparative results, it found that the modified protocol is always outperforming the ASGK protocol.
Scalable Combinatorial Tools for Health Disparities Research
Directory of Open Access Journals (Sweden)
Michael A. Langston
2014-10-01
Full Text Available Despite staggering investments made in unraveling the human genome, current estimates suggest that as much as 90% of the variance in cancer and chronic diseases can be attributed to factors outside an individual’s genetic endowment, particularly to environmental exposures experienced across his or her life course. New analytical approaches are clearly required as investigators turn to complicated systems theory and ecological, place-based and life-history perspectives in order to understand more clearly the relationships between social determinants, environmental exposures and health disparities. While traditional data analysis techniques remain foundational to health disparities research, they are easily overwhelmed by the ever-increasing size and heterogeneity of available data needed to illuminate latent gene x environment interactions. This has prompted the adaptation and application of scalable combinatorial methods, many from genome science research, to the study of population health. Most of these powerful tools are algorithmically sophisticated, highly automated and mathematically abstract. Their utility motivates the main theme of this paper, which is to describe real applications of innovative transdisciplinary models and analyses in an effort to help move the research community closer toward identifying the causal mechanisms and associated environmental contexts underlying health disparities. The public health exposome is used as a contemporary focus for addressing the complex nature of this subject.
Scalable Notch Antenna System for Multiport Applications
Directory of Open Access Journals (Sweden)
Abdurrahim Toktas
2016-01-01
Full Text Available A novel and compact scalable antenna system is designed for multiport applications. The basic design is built on a square patch with an electrical size of 0.82λ0×0.82λ0 (at 2.4 GHz on a dielectric substrate. The design consists of four symmetrical and orthogonal triangular notches with circular feeding slots at the corners of the common patch. The 4-port antenna can be simply rearranged to 8-port and 12-port systems. The operating band of the system can be tuned by scaling (S the size of the system while fixing the thickness of the substrate. The antenna system with S: 1/1 in size of 103.5×103.5 mm2 operates at the frequency band of 2.3–3.0 GHz. By scaling the antenna with S: 1/2.3, a system of 45×45 mm2 is achieved, and thus the operating band is tuned to 4.7–6.1 GHz with the same scattering characteristic. A parametric study is also conducted to investigate the effects of changing the notch dimensions. The performance of the antenna is verified in terms of the antenna characteristics as well as diversity and multiplexing parameters. The antenna system can be tuned by scaling so that it is applicable to the multiport WLAN, WIMAX, and LTE devices with port upgradability.
A Programmable, Scalable-Throughput Interleaver
Directory of Open Access Journals (Sweden)
E. J. C. Rijshouwer
2010-01-01
Full Text Available The interleaver stages of digital communication standards show a surprisingly large variation in throughput, state sizes, and permutation functions. Furthermore, data rates for 4G standards such as LTE-Advanced will exceed typical baseband clock frequencies of handheld devices. Multistream operation for Software Defined Radio and iterative decoding algorithms will call for ever higher interleave data rates. Our interleave machine is built around 8 single-port SRAM banks and can be programmed to generate up to 8 addresses every clock cycle. The scalable architecture combines SIMD and VLIW concepts with an efficient resolution of bank conflicts. A wide range of cellular, connectivity, and broadcast interleavers have been mapped on this machine, with throughputs up to more than 0.5 Gsymbol/second. Although it was designed for channel interleaving, the application domain of the interleaver extends also to Turbo interleaving. The presented configuration of the architecture is designed as a part of a programmable outer receiver on a prototype board. It offers (near universal programmability to enable the implementation of new interleavers. The interleaver measures 2.09 mm2 in 65 nm CMOS (including memories and proves functional on silicon.
A Programmable, Scalable-Throughput Interleaver
Directory of Open Access Journals (Sweden)
Rijshouwer EJC
2010-01-01
Full Text Available The interleaver stages of digital communication standards show a surprisingly large variation in throughput, state sizes, and permutation functions. Furthermore, data rates for 4G standards such as LTE-Advanced will exceed typical baseband clock frequencies of handheld devices. Multistream operation for Software Defined Radio and iterative decoding algorithms will call for ever higher interleave data rates. Our interleave machine is built around 8 single-port SRAM banks and can be programmed to generate up to 8 addresses every clock cycle. The scalable architecture combines SIMD and VLIW concepts with an efficient resolution of bank conflicts. A wide range of cellular, connectivity, and broadcast interleavers have been mapped on this machine, with throughputs up to more than 0.5 Gsymbol/second. Although it was designed for channel interleaving, the application domain of the interleaver extends also to Turbo interleaving. The presented configuration of the architecture is designed as a part of a programmable outer receiver on a prototype board. It offers (near universal programmability to enable the implementation of new interleavers. The interleaver measures 2.09 m in 65 nm CMOS (including memories and proves functional on silicon.
A parallel algorithm for 3D particle tracking and Lagrangian trajectory reconstruction
International Nuclear Information System (INIS)
Barker, Douglas; Zhang, Yuanhui; Lifflander, Jonathan; Arya, Anshu
2012-01-01
Particle-tracking methods are widely used in fluid mechanics and multi-target tracking research because of their unique ability to reconstruct long trajectories with high spatial and temporal resolution. Researchers have recently demonstrated 3D tracking of several objects in real time, but as the number of objects is increased, real-time tracking becomes impossible due to data transfer and processing bottlenecks. This problem may be solved by using parallel processing. In this paper, a parallel-processing framework has been developed based on frame decomposition and is programmed using the asynchronous object-oriented Charm++ paradigm. This framework can be a key step in achieving a scalable Lagrangian measurement system for particle-tracking velocimetry and may lead to real-time measurement capabilities. The parallel tracking algorithm was evaluated with three data sets including the particle image velocimetry standard 3D images data set #352, a uniform data set for optimal parallel performance and a computational-fluid-dynamics-generated non-uniform data set to test trajectory reconstruction accuracy, consistency with the sequential version and scalability to more than 500 processors. The algorithm showed strong scaling up to 512 processors and no inherent limits of scalability were seen. Ultimately, up to a 200-fold speedup is observed compared to the serial algorithm when 256 processors were used. The parallel algorithm is adaptable and could be easily modified to use any sequential tracking algorithm, which inputs frames of 3D particle location data and outputs particle trajectories
Zhang, Xiaoyuan
2011-01-01
The combined use of brush anodes and glass fiber (GF1) separators, and plastic mesh supporters were used here for the first time to create a scalable microbial fuel cell architecture. Separators prevented short circuiting of closely-spaced electrodes, and cathode supporters were used to avoid water gaps between the separator and cathode that can reduce power production. The maximum power density with a separator and supporter and a single cathode was 75±1W/m3. Removing the separator decreased power by 8%. Adding a second cathode increased power to 154±1W/m3. Current was increased by connecting two MFCs connected in parallel. These results show that brush anodes, combined with a glass fiber separator and a plastic mesh supporter, produce a useful MFC architecture that is inherently scalable due to good insulation between the electrodes and a compact architecture. © 2010 Elsevier Ltd.
Woźniak, M.
2016-06-02
We study the features of a new mixed integration scheme dedicated to solving the non-stationary variational problems. The scheme is composed of the FEM approximation with respect to the space variable coupled with a 3-leveled time integration scheme with a linearized right-hand side operator. It was applied in solving the Cahn-Hilliard parabolic equation with a nonlinear, fourth-order elliptic part. The second order of the approximation along the time variable was proven. Moreover, the good scalability of the software based on this scheme was confirmed during simulations. We verify the proposed time integration scheme by monitoring the Ginzburg-Landau free energy. The numerical simulations are performed by using a parallel multi-frontal direct solver executed over STAMPEDE Linux cluster. Its scalability was compared to the results of the three direct solvers, including MUMPS, SuperLU and PaSTiX.
Heterogeneous scalable framework for multiphase flows
Energy Technology Data Exchange (ETDEWEB)
Morris, Karla Vanessa [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2013-09-01
Two categories of challenges confront the developer of computational spray models: those related to the computation and those related to the physics. Regarding the computation, the trend towards heterogeneous, multi- and many-core platforms will require considerable re-engineering of codes written for the current supercomputing platforms. Regarding the physics, accurate methods for transferring mass, momentum and energy from the dispersed phase onto the carrier fluid grid have so far eluded modelers. Significant challenges also lie at the intersection between these two categories. To be competitive, any physics model must be expressible in a parallel algorithm that performs well on evolving computer platforms. This work created an application based on a software architecture where the physics and software concerns are separated in a way that adds flexibility to both. The develop spray-tracking package includes an application programming interface (API) that abstracts away the platform-dependent parallelization concerns, enabling the scientific programmer to write serial code that the API resolves into parallel processes and threads of execution. The project also developed the infrastructure required to provide similar APIs to other application. The API allow object-oriented Fortran applications direct interaction with Trilinos to support memory management of distributed objects in central processing units (CPU) and graphic processing units (GPU) nodes for applications using C++.
Scalable IC Platform for Smart Cameras
Directory of Open Access Journals (Sweden)
Harry Broers
2005-08-01
Full Text Available Smart cameras are among the emerging new fields of electronics. The points of interest are in the application areas, software and IC development. In order to reduce cost, it is worthwhile to invest in a single architecture that can be scaled for the various application areas in performance (and resulting power consumption. In this paper, we show that the combination of an SIMD (single-instruction multiple-data processor and a general-purpose DSP is very advantageous for the image processing tasks encountered in smart cameras. While the SIMD processor gives the very high performance necessary by exploiting the inherent data parallelism found in the pixel crunching part of the algorithms, the DSP offers a friendly approach to the more complex tasks. The paper continues to motivate that SIMD processors have very convenient scaling properties in silicon, making the complete, SIMD-DSP architecture suitable for different application areas without changing the software suite. Analysis of the changes in power consumption due to scaling shows that for typical image processing tasks, it is beneficial to scale the SIMD processor to use the maximum level of parallelism available in the algorithm if the IC supply voltage can be lowered. If silicon cost is of importance, the parallelism of the processor should be scaled to just reach the desired performance given the speed of the silicon.
Scalable Computational Chemistry: New Developments and Applications
Energy Technology Data Exchange (ETDEWEB)
Alexeev, Yuri [Iowa State Univ., Ames, IA (United States)
2002-01-01
The computational part of the thesis is the investigation of titanium chloride (II) as a potential catalyst for the bis-silylation reaction of ethylene with hexaclorodisilane at different levels of theory. Bis-silylation is an important reaction for producing bis(silyl) compounds and new C-Si bonds, which can serve as monomers for silicon containing polymers and silicon carbides. Ab initio calculations on the steps involved in a proposed mechanism are presented. This choice of reactants allows them to study this reaction at reliable levels of theory without compromising accuracy. The calculations indicate that this is a highly exothermic barrierless reaction. The TiCl_{2} catalyst removes a 50 kcal/mol activation energy barrier required for the reaction without the catalyst. The first step is interaction of TiCl_{2} with ethylene to form an intermediate that is 60 kcal/mol below the energy of the reactants. This is the driving force for the entire reaction. Dynamic correlation plays a significant role because RHF calculations indicate that the net barrier for the catalyzed reaction is 50 kcal/mol. They conclude that divalent Ti has the potential to become an important industrial catalyst for silylation reactions. In the programming part of the thesis, parallelization of different quantum chemistry methods is presented. The parallelization of code is becoming important aspects of quantum chemistry code development. Two trends contribute to it: the overall desire to study large chemical systems and the desire to employ highly correlated methods which are usually computationally and memory expensive. In the presented distributed data algorithms computation is parallelized and the largest arrays are evenly distributed among CPUs. First, the parallelization of the Hartree-Fock self-consistent field (SCF) method is considered. SCF method is the most common starting point for more accurate calculations. The Fock build (sub step of SCF) from AO integrals is
Algorithms for parallel computers
International Nuclear Information System (INIS)
Churchhouse, R.F.
1985-01-01
Until relatively recently almost all the algorithms for use on computers had been designed on the (usually unstated) assumption that they were to be run on single processor, serial machines. With the introduction of vector processors, array processors and interconnected systems of mainframes, minis and micros, however, various forms of parallelism have become available. The advantage of parallelism is that it offers increased overall processing speed but it also raises some fundamental questions, including: (i) which, if any, of the existing 'serial' algorithms can be adapted for use in the parallel mode. (ii) How close to optimal can such adapted algorithms be and, where relevant, what are the convergence criteria. (iii) How can we design new algorithms specifically for parallel systems. (iv) For multi-processor systems how can we handle the software aspects of the interprocessor communications. Aspects of these questions illustrated by examples are considered in these lectures. (orig.)
Parallelism and array processing
International Nuclear Information System (INIS)
Zacharov, V.
1983-01-01
Modern computing, as well as the historical development of computing, has been dominated by sequential monoprocessing. Yet there is the alternative of parallelism, where several processes may be in concurrent execution. This alternative is discussed in a series of lectures, in which the main developments involving parallelism are considered, both from the standpoint of computing systems and that of applications that can exploit such systems. The lectures seek to discuss parallelism in a historical context, and to identify all the main aspects of concurrency in computation right up to the present time. Included will be consideration of the important question as to what use parallelism might be in the field of data processing. (orig.)
Parallel processing implementation for the coupled transport of photons and electrons using OpenMP
Doerner, Edgardo
2016-05-01
In this work the use of OpenMP to implement the parallel processing of the Monte Carlo (MC) simulation of the coupled transport for photons and electrons is presented. This implementation was carried out using a modified EGSnrc platform which enables the use of the Microsoft Visual Studio 2013 (VS2013) environment, together with the developing tools available in the Intel Parallel Studio XE 2015 (XE2015). The performance study of this new implementation was carried out in a desktop PC with a multi-core CPU, taking as a reference the performance of the original platform. The results were satisfactory, both in terms of scalability as parallelization efficiency.
A Testbed for Highly-Scalable Mission Critical Information Systems
National Research Council Canada - National Science Library
Birman, Kenneth P
2005-01-01
... systems in a networked environment. Headed by Professor Ken Birman, the project is exploring a novel fusion of classical protocols for reliable multicast communication with a new style of peer-to-peer protocol called scalable "gossip...
Scalable Partitioning Algorithms for FPGAs With Heterogeneous Resources
National Research Council Canada - National Science Library
Selvakkumaran, Navaratnasothie; Ranjan, Abhishek; Raje, Salil; Karypis, George
2004-01-01
As FPGA densities increase, partitioning-based FPGA placement approaches are becoming increasingly important as they can be used to provide high-quality and computationally scalable placement solutions...
Parallel magnetic resonance imaging
International Nuclear Information System (INIS)
Larkman, David J; Nunes, Rita G
2007-01-01
Parallel imaging has been the single biggest innovation in magnetic resonance imaging in the last decade. The use of multiple receiver coils to augment the time consuming Fourier encoding has reduced acquisition times significantly. This increase in speed comes at a time when other approaches to acquisition time reduction were reaching engineering and human limits. A brief summary of spatial encoding in MRI is followed by an introduction to the problem parallel imaging is designed to solve. There are a large number of parallel reconstruction algorithms; this article reviews a cross-section, SENSE, SMASH, g-SMASH and GRAPPA, selected to demonstrate the different approaches. Theoretical (the g-factor) and practical (coil design) limits to acquisition speed are reviewed. The practical implementation of parallel imaging is also discussed, in particular coil calibration. How to recognize potential failure modes and their associated artefacts are shown. Well-established applications including angiography, cardiac imaging and applications using echo planar imaging are reviewed and we discuss what makes a good application for parallel imaging. Finally, active research areas where parallel imaging is being used to improve data quality by repairing artefacted images are also reviewed. (invited topical review)
SOL: A Library for Scalable Online Learning Algorithms
Wu, Yue; Hoi, Steven C. H.; Liu, Chenghao; Lu, Jing; Sahoo, Doyen; Yu, Nenghai
2016-01-01
SOL is an open-source library for scalable online learning algorithms, and is particularly suitable for learning with high-dimensional data. The library provides a family of regular and sparse online learning algorithms for large-scale binary and multi-class classification tasks with high efficiency, scalability, portability, and extensibility. SOL was implemented in C++, and provided with a collection of easy-to-use command-line tools, python wrappers and library calls for users and develope...
Modular Universal Scalable Ion-trap Quantum Computer
2016-06-02
SECURITY CLASSIFICATION OF: The main goal of the original MUSIQC proposal was to construct and demonstrate a modular and universally- expandable ion...Distribution Unlimited UU UU UU UU 02-06-2016 1-Aug-2010 31-Jan-2016 Final Report: Modular Universal Scalable Ion-trap Quantum Computer The views...P.O. Box 12211 Research Triangle Park, NC 27709-2211 Ion trap quantum computation, scalable modular architectures REPORT DOCUMENTATION PAGE 11
Architectures and Applications for Scalable Quantum Information Systems
2007-01-01
Gershenfeld and I. Chuang. Quantum computing with molecules. Scientific American, June 1998. [16] A. Globus, D. Bailey, J. Han, R. Jaffe, C. Levit , R...AFRL-IF-RS-TR-2007-12 Final Technical Report January 2007 ARCHITECTURES AND APPLICATIONS FOR SCALABLE QUANTUM INFORMATION SYSTEMS...NUMBER 5b. GRANT NUMBER FA8750-01-2-0521 4. TITLE AND SUBTITLE ARCHITECTURES AND APPLICATIONS FOR SCALABLE QUANTUM INFORMATION SYSTEMS 5c
On the scalability of LISP and advanced overlaid services
Coras, Florin
2015-01-01
In just four decades the Internet has gone from a lab experiment to a worldwide, business critical infrastructure that caters to the communication needs of almost a half of the Earth's population. With these figures on its side, arguing against the Internet's scalability would seem rather unwise. However, the Internet's organic growth is far from finished and, as billions of new devices are expected to be joined in the not so distant future, scalability, or lack thereof, is commonly believed ...
Scalable, full-colour and controllable chromotropic plasmonic printing
Xue, Jiancai; Zhou, Zhang-Kai; Wei, Zhiqiang; Su, Rongbin; Lai, Juan; Li, Juntao; Li, Chao; Zhang, Tengwei; Wang, Xue-Hua
2015-01-01
Plasmonic colour printing has drawn wide attention as a promising candidate for the next-generation colour-printing technology. However, an efficient approach to realize full colour and scalable fabrication is still lacking, which prevents plasmonic colour printing from practical applications. Here we present a scalable and full-colour plasmonic printing approach by combining conjugate twin-phase modulation with a plasmonic broadband absorber. More importantly, our approach also demonstrates ...
Responsive, Flexible and Scalable Broader Impacts (Invited)
Decharon, A.; Companion, C.; Steinman, M.
2010-12-01
In many educator professional development workshops, scientists present content in a slideshow-type format and field questions afterwards. Drawbacks of this approach include: inability to begin the lecture with content that is responsive to audience needs; lack of flexible access to specific material within the linear presentation; and “Q&A” sessions are not easily scalable to broader audiences. Often this type of traditional interaction provides little direct benefit to the scientists. The Centers for Ocean Sciences Education Excellence - Ocean Systems (COSEE-OS) applies the technique of concept mapping with demonstrated effectiveness in helping scientists and educators “get on the same page” (deCharon et al., 2009). A key aspect is scientist professional development geared towards improving face-to-face and online communication with non-scientists. COSEE-OS promotes scientist-educator collaboration, tests the application of scientist-educator maps in new contexts through webinars, and is piloting the expansion of maps as long-lived resources for the broader community. Collaboration - COSEE-OS has developed and tested a workshop model bringing scientists and educators together in a peer-oriented process, often clarifying common misconceptions. Scientist-educator teams develop online concept maps that are hyperlinked to “assets” (i.e., images, videos, news) and are responsive to the needs of non-scientist audiences. In workshop evaluations, 91% of educators said that the process of concept mapping helped them think through science topics and 89% said that concept mapping helped build a bridge of communication with scientists (n=53). Application - After developing a concept map, with COSEE-OS staff assistance, scientists are invited to give webinar presentations that include live “Q&A” sessions. The webinars extend the reach of scientist-created concept maps to new contexts, both geographically and topically (e.g., oil spill), with a relatively small
A parallel solution for high resolution histological image analysis.
Bueno, G; González, R; Déniz, O; García-Rojo, M; González-García, J; Fernández-Carrobles, M M; Vállez, N; Salido, J
2012-10-01
This paper describes a general methodology for developing parallel image processing algorithms based on message passing for high resolution images (on the order of several Gigabytes). These algorithms have been applied to histological images and must be executed on massively parallel processing architectures. Advances in new technologies for complete slide digitalization in pathology have been combined with developments in biomedical informatics. However, the efficient use of these digital slide systems is still a challenge. The image processing that these slides are subject to is still limited both in terms of data processed and processing methods. The work presented here focuses on the need to design and develop parallel image processing tools capable of obtaining and analyzing the entire gamut of information included in digital slides. Tools have been developed to assist pathologists in image analysis and diagnosis, and they cover low and high-level image processing methods applied to histological images. Code portability, reusability and scalability have been tested by using the following parallel computing architectures: distributed memory with massive parallel processors and two networks, INFINIBAND and Myrinet, composed of 17 and 1024 nodes respectively. The parallel framework proposed is flexible, high performance solution and it shows that the efficient processing of digital microscopic images is possible and may offer important benefits to pathology laboratories. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Microscopic Characterization of Scalable Coherent Rydberg Superatoms
Directory of Open Access Journals (Sweden)
Johannes Zeiher
2015-08-01
Full Text Available Strong interactions can amplify quantum effects such that they become important on macroscopic scales. Controlling these coherently on a single-particle level is essential for the tailored preparation of strongly correlated quantum systems and opens up new prospects for quantum technologies. Rydberg atoms offer such strong interactions, which lead to extreme nonlinearities in laser-coupled atomic ensembles. As a result, multiple excitation of a micrometer-sized cloud can be blocked while the light-matter coupling becomes collectively enhanced. The resulting two-level system, often called a “superatom,” is a valuable resource for quantum information, providing a collective qubit. Here, we report on the preparation of 2 orders of magnitude scalable superatoms utilizing the large interaction strength provided by Rydberg atoms combined with precise control of an ensemble of ultracold atoms in an optical lattice. The latter is achieved with sub-shot-noise precision by local manipulation of a two-dimensional Mott insulator. We microscopically confirm the superatom picture by in situ detection of the Rydberg excitations and observe the characteristic square-root scaling of the optical coupling with the number of atoms. Enabled by the full control over the atomic sample, including the motional degrees of freedom, we infer the overlap of the produced many-body state with a W state from the observed Rabi oscillations and deduce the presence of entanglement. Finally, we investigate the breakdown of the superatom picture when two Rydberg excitations are present in the system, which leads to dephasing and a loss of coherence.
Integration experiences and performance studies of A COTS parallel archive systems
Energy Technology Data Exchange (ETDEWEB)
Chen, Hsing-bung [Los Alamos National Laboratory; Scott, Cody [Los Alamos National Laboratory; Grider, Bary [Los Alamos National Laboratory; Torres, Aaron [Los Alamos National Laboratory; Turley, Milton [Los Alamos National Laboratory; Sanchez, Kathy [Los Alamos National Laboratory; Bremer, John [Los Alamos National Laboratory
2010-01-01
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf(COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, ls, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petaflop/s computing system, LANL's Roadrunner, and demonstrated its capability to address requirements of
Integration experiments and performance studies of a COTS parallel archive system
Energy Technology Data Exchange (ETDEWEB)
Chen, Hsing-bung [Los Alamos National Laboratory; Scott, Cody [Los Alamos National Laboratory; Grider, Gary [Los Alamos National Laboratory; Torres, Aaron [Los Alamos National Laboratory; Turley, Milton [Los Alamos National Laboratory; Sanchez, Kathy [Los Alamos National Laboratory; Bremer, John [Los Alamos National Laboratory
2010-06-16
Current and future Archive Storage Systems have been asked to (a) scale to very high bandwidths, (b) scale in metadata performance, (c) support policy-based hierarchical storage management capability, (d) scale in supporting changing needs of very large data sets, (e) support standard interface, and (f) utilize commercial-off-the-shelf (COTS) hardware. Parallel file systems have been asked to do the same thing but at one or more orders of magnitude faster in performance. Archive systems continue to move closer to file systems in their design due to the need for speed and bandwidth, especially metadata searching speeds such as more caching and less robust semantics. Currently the number of extreme highly scalable parallel archive solutions is very small especially those that will move a single large striped parallel disk file onto many tapes in parallel. We believe that a hybrid storage approach of using COTS components and innovative software technology can bring new capabilities into a production environment for the HPC community much faster than the approach of creating and maintaining a complete end-to-end unique parallel archive software solution. In this paper, we relay our experience of integrating a global parallel file system and a standard backup/archive product with a very small amount of additional code to provide a scalable, parallel archive. Our solution has a high degree of overlap with current parallel archive products including (a) doing parallel movement to/from tape for a single large parallel file, (b) hierarchical storage management, (c) ILM features, (d) high volume (non-single parallel file) archives for backup/archive/content management, and (e) leveraging all free file movement tools in Linux such as copy, move, Is, tar, etc. We have successfully applied our working COTS Parallel Archive System to the current world's first petafiop/s computing system, LANL's Roadrunner machine, and demonstrated its capability to address
Empirical study of parallel LRU simulation algorithms
Carr, Eric; Nicol, David M.
1994-01-01
This paper reports on the performance of five parallel algorithms for simulating a fully associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are SIMD, and are implemented on the MasPar MP-2 architecture. Two other algorithms are parallelizations of an efficient serial algorithm on the Intel Paragon. One SIMD algorithm is quite simple, but its cost is linear in the cache size. The two other SIMD algorithm are more complex, but have costs that are independent on the cache size. Both the second and third SIMD algorithms compute all stack distances; the second SIMD algorithm is completely general, whereas the third SIMD algorithm presumes and takes advantage of bounds on the range of reference tags. Both MIMD algorithm implemented on the Paragon are general and compute all stack distances; they differ in one step that may affect their respective scalability. We assess the strengths and weaknesses of these algorithms as a function of problem size and characteristics, and compare their performance on traces derived from execution of three SPEC benchmark programs.
Directory of Open Access Journals (Sweden)
Thomas E. Rosmond
2000-01-01
Full Text Available The Navy Operational Global Atmospheric Prediction System (NOGAPS includes a state-of-the-art spectral forecast model similar to models run at several major operational numerical weather prediction (NWP centers around the world. The model, developed by the Naval Research Laboratory (NRL in Monterey, California, has run operational at the Fleet Numerical Meteorological and Oceanographic Center (FNMOC since 1982, and most recently is being run on a Cray C90 in a multi-tasked configuration. Typically the multi-tasked code runs on 10 to 15 processors with overall parallel efficiency of about 90%. resolution is T159L30, but other operational and research applications run at significantly lower resolutions. A scalable NOGAPS forecast model has been developed by NRL in anticipation of a FNMOC C90 replacement in about 2001, as well as for current NOGAPS research requirements to run on DOD High-Performance Computing (HPC scalable systems. The model is designed to run with message passing (MPI. Model design criteria include bit reproducibility for different processor numbers and reasonably efficient performance on fully shared memory, distributed memory, and distributed shared memory systems for a wide range of model resolutions. Results for a wide range of processor numbers, model resolutions, and different vendor architectures are presented. Single node performance has been disappointing on RISC based systems, at least compared to vector processor performance. This is a common complaint, and will require careful re-examination of traditional numerical weather prediction (NWP model software design and data organization to fully exploit future scalable architectures.
Parallel computing solution of Boltzmann neutron transport equation
International Nuclear Information System (INIS)
Ansah-Narh, T.
2010-01-01
The focus of the research was on developing parallel computing algorithm for solving Eigen-values of the Boltzmam Neutron Transport Equation (BNTE) in a slab geometry using multi-grid approach. In response to the problem of slow execution of serial computing when solving large problems, such as BNTE, the study was focused on the design of parallel computing systems which was an evolution of serial computing that used multiple processing elements simultaneously to solve complex physical and mathematical problems. Finite element method (FEM) was used for the spatial discretization scheme, while angular discretization was accomplished by expanding the angular dependence in terms of Legendre polynomials. The eigenvalues representing the multiplication factors in the BNTE were determined by the power method. MATLAB Compiler Version 4.1 (R2009a) was used to compile the MATLAB codes of BNTE. The implemented parallel algorithms were enabled with matlabpool, a Parallel Computing Toolbox function. The option UseParallel was set to 'always' and the default value of the option was 'never'. When those conditions held, the solvers computed estimated gradients in parallel. The parallel computing system was used to handle all the bottlenecks in the matrix generated from the finite element scheme and each domain of the power method generated. The parallel algorithm was implemented on a Symmetric Multi Processor (SMP) cluster machine, which had Intel 32 bit quad-core x 86 processors. Convergence rates and timings for the algorithm on the SMP cluster machine were obtained. Numerical experiments indicated the designed parallel algorithm could reach perfect speedup and had good stability and scalability. (au)
The STAPL Parallel Graph Library
Harshvardhan,; Fidel, Adam; Amato, Nancy M.; Rauchwerger, Lawrence
2013-01-01
This paper describes the stapl Parallel Graph Library, a high-level framework that abstracts the user from data-distribution and parallelism details and allows them to concentrate on parallel graph algorithm development. It includes a customizable
Evaluating parallel relational databases for medical data analysis.
Energy Technology Data Exchange (ETDEWEB)
Rintoul, Mark Daniel; Wilson, Andrew T.
2012-03-01
Hospitals have always generated and consumed large amounts of data concerning patients, treatment and outcomes. As computers and networks have permeated the hospital environment it has become feasible to collect and organize all of this data. This raises naturally the question of how to deal with the resulting mountain of information. In this report we detail a proof-of-concept test using two commercially available parallel database systems to analyze a set of real, de-identified medical records. We examine database scalability as data sizes increase as well as responsiveness under load from multiple users.
Scalable real space pseudopotential density functional codes for materials in the exascale regime
Lena, Charles; Chelikowsky, James; Schofield, Grady; Biller, Ariel; Kronik, Leeor; Saad, Yousef; Deslippe, Jack
Real-space pseudopotential density functional theory has proven to be an efficient method for computing the properties of matter in many different states and geometries, including liquids, wires, slabs, and clusters with and without spin polarization. Fully self-consistent solutions using this approach have been routinely obtained for systems with thousands of atoms. Yet, there are many systems of notable larger sizes where quantum mechanical accuracy is desired, but scalability proves to be a hindrance. Such systems include large biological molecules, complex nanostructures, or mismatched interfaces. We will present an overview of our new massively parallel algorithms, which offer improved scalability in preparation for exascale supercomputing. We will illustrate these algorithms by considering the electronic structure of a Si nanocrystal exceeding 104 atoms. Support provided by the SciDAC program, Department of Energy, Office of Science, Advanced Scientific Computing Research and Basic Energy Sciences. Grant Numbers DE-SC0008877 (Austin) and DE-FG02-12ER4 (Berkeley).
Space-Filling Supercapacitor Carpets: Highly scalable fractal architecture for energy storage
Tiliakos, Athanasios; Trefilov, Alexandra M. I.; Tanasǎ, Eugenia; Balan, Adriana; Stamatin, Ioan
2018-04-01
Revamping ground-breaking ideas from fractal geometry, we propose an alternative micro-supercapacitor configuration realized by laser-induced graphene (LIG) foams produced via laser pyrolysis of inexpensive commercial polymers. The Space-Filling Supercapacitor Carpet (SFSC) architecture introduces the concept of nested electrodes based on the pre-fractal Peano space-filling curve, arranged in a symmetrical equilateral setup that incorporates multiple parallel capacitor cells sharing common electrodes for maximum efficiency and optimal length-to-area distribution. We elucidate on the theoretical foundations of the SFSC architecture, and we introduce innovations (high-resolution vector-mode printing) in the LIG method that allow for the realization of flexible and scalable devices based on low iterations of the Peano algorithm. SFSCs exhibit distributed capacitance properties, leading to capacitance, energy, and power ratings proportional to the number of nested electrodes (up to 4.3 mF, 0.4 μWh, and 0.2 mW for the largest tested model of low iteration using aqueous electrolytes), with competitively high energy and power densities. This can pave the road for full scalability in energy storage, reaching beyond the scale of micro-supercapacitors for incorporating into larger and more demanding applications.
Scalable electrophysiology in intact small animals with nanoscale suspended electrode arrays
Gonzales, Daniel L.; Badhiwala, Krishna N.; Vercosa, Daniel G.; Avants, Benjamin W.; Liu, Zheng; Zhong, Weiwei; Robinson, Jacob T.
2017-07-01
Electrical measurements from large populations of animals would help reveal fundamental properties of the nervous system and neurological diseases. Small invertebrates are ideal for these large-scale studies; however, patch-clamp electrophysiology in microscopic animals typically requires invasive dissections and is low-throughput. To overcome these limitations, we present nano-SPEARs: suspended electrodes integrated into a scalable microfluidic device. Using this technology, we have made the first extracellular recordings of body-wall muscle electrophysiology inside an intact roundworm, Caenorhabditis elegans. We can also use nano-SPEARs to record from multiple animals in parallel and even from other species, such as Hydra littoralis. Furthermore, we use nano-SPEARs to establish the first electrophysiological phenotypes for C. elegans models for amyotrophic lateral sclerosis and Parkinson's disease, and show a partial rescue of the Parkinson's phenotype through drug treatment. These results demonstrate that nano-SPEARs provide the core technology for microchips that enable scalable, in vivo studies of neurobiology and neurological diseases.
Miao, Wang; Luo, Jun; Di Lucente, Stefano; Dorren, Harm; Calabretta, Nicola
2014-02-10
We propose and demonstrate an optical flat datacenter network based on scalable optical switch system with optical flow control. Modular structure with distributed control results in port-count independent optical switch reconfiguration time. RF tone in-band labeling technique allowing parallel processing of the label bits ensures the low latency operation regardless of the switch port-count. Hardware flow control is conducted at optical level by re-using the label wavelength without occupying extra bandwidth, space, and network resources which further improves the performance of latency within a simple structure. Dynamic switching including multicasting operation is validated for a 4 x 4 system. Error free operation of 40 Gb/s data packets has been achieved with only 1 dB penalty. The system could handle an input load up to 0.5 providing a packet loss lower that 10(-5) and an average latency less that 500 ns when a buffer size of 16 packets is employed. Investigation on scalability also indicates that the proposed system could potentially scale up to large port count with limited power penalty.
Scalable microbial fuel cell (MFC) stack for continuous real wastewater treatment.
Zhuang, Li; Zheng, Yu; Zhou, Shungui; Yuan, Yong; Yuan, Haoran; Chen, Yong
2012-02-01
A tubular air-cathode microbial fuel cell (MFC) stack with high scalability and low material cost was constructed and the ability of simultaneous real wastewater treatment and bioelectricity generation was investigated under continuous flow mode. At the two organic loading rates (ORLs) tested (1.2 and 4.9kg COD/m(3)d), five non-Pt MFCs connected in series and parallel circuit modes treating swine wastewater can enable an increase of the voltage and the current. The parallel stack retained high power output and the series connection underwent energy loss due to the substrate cross-conduction effect. With continuous electricity production, the parallel stack achieved 83.8% of COD removal and 90.8% of NH(4)(+)-N removal at 1.2kg COD/m(3)d, and 77.1% COD removal and 80.7% NH(4)(+)-N removal at 4.9kg COD/m(3)d. The MFC stack system in this study was demonstrated to be able to treat real wastewater with the added benefit of harvesting electricity energy. Copyright © 2011 Elsevier Ltd. All rights reserved.
Scalable quantum information processing with photons and atoms
Pan, Jian-Wei
Over the past three decades, the promises of super-fast quantum computing and secure quantum cryptography have spurred a world-wide interest in quantum information, generating fascinating quantum technologies for coherent manipulation of individual quantum systems. However, the distance of fiber-based quantum communications is limited due to intrinsic fiber loss and decreasing of entanglement quality. Moreover, probabilistic single-photon source and entanglement source demand exponentially increased overheads for scalable quantum information processing. To overcome these problems, we are taking two paths in parallel: quantum repeaters and through satellite. We used the decoy-state QKD protocol to close the loophole of imperfect photon source, and used the measurement-device-independent QKD protocol to close the loophole of imperfect photon detectors--two main loopholes in quantum cryptograph. Based on these techniques, we are now building world's biggest quantum secure communication backbone, from Beijing to Shanghai, with a distance exceeding 2000 km. Meanwhile, we are developing practically useful quantum repeaters that combine entanglement swapping, entanglement purification, and quantum memory for the ultra-long distance quantum communication. The second line is satellite-based global quantum communication, taking advantage of the negligible photon loss and decoherence in the atmosphere. We realized teleportation and entanglement distribution over 100 km, and later on a rapidly moving platform. We are also making efforts toward the generation of multiphoton entanglement and its use in teleportation of multiple properties of a single quantum particle, topological error correction, quantum algorithms for solving systems of linear equations and machine learning. Finally, I will talk about our recent experiments on quantum simulations on ultracold atoms. On the one hand, by applying an optical Raman lattice technique, we realized a two-dimensional spin-obit (SO
Laplacian embedded regression for scalable manifold regularization.
Chen, Lin; Tsang, Ivor W; Xu, Dong
2012-06-01
world data sets show the effectiveness and scalability of the proposed framework.
Massively parallel multicanonical simulations
Gross, Jonathan; Zierenberg, Johannes; Weigel, Martin; Janke, Wolfhard
2018-03-01
Generalized-ensemble Monte Carlo simulations such as the multicanonical method and similar techniques are among the most efficient approaches for simulations of systems undergoing discontinuous phase transitions or with rugged free-energy landscapes. As Markov chain methods, they are inherently serial computationally. It was demonstrated recently, however, that a combination of independent simulations that communicate weight updates at variable intervals allows for the efficient utilization of parallel computational resources for multicanonical simulations. Implementing this approach for the many-thread architecture provided by current generations of graphics processing units (GPUs), we show how it can be efficiently employed with of the order of 104 parallel walkers and beyond, thus constituting a versatile tool for Monte Carlo simulations in the era of massively parallel computing. We provide the fully documented source code for the approach applied to the paradigmatic example of the two-dimensional Ising model as starting point and reference for practitioners in the field.
A novel two-level dynamic parallel data scheme for large 3-D SN calculations
International Nuclear Information System (INIS)
Sjoden, G.E.; Shedlock, D.; Haghighat, A.; Yi, C.
2005-01-01
We introduce a new dynamic parallel memory optimization scheme for executing large scale 3-D discrete ordinates (Sn) simulations on distributed memory parallel computers. In order for parallel transport codes to be truly scalable, they must use parallel data storage, where only the variables that are locally computed are locally stored. Even with parallel data storage for the angular variables, cumulative storage requirements for large discrete ordinates calculations can be prohibitive. To address this problem, Memory Tuning has been implemented into the PENTRAN 3-D parallel discrete ordinates code as an optimized, two-level ('large' array, 'small' array) parallel data storage scheme. Memory Tuning can be described as the process of parallel data memory optimization. Memory Tuning dynamically minimizes the amount of required parallel data in allocated memory on each processor using a statistical sampling algorithm. This algorithm is based on the integral average and standard deviation of the number of fine meshes contained in each coarse mesh in the global problem. Because PENTRAN only stores the locally computed problem phase space, optimal two-level memory assignments can be unique on each node, depending upon the parallel decomposition used (hybrid combinations of angular, energy, or spatial). As demonstrated in the two large discrete ordinates models presented (a storage cask and an OECD MOX Benchmark), Memory Tuning can save a substantial amount of memory per parallel processor, allowing one to accomplish very large scale Sn computations. (authors)
Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage
Energy Technology Data Exchange (ETDEWEB)
Kim, Jungwon [ORNL; Sajjapongse, Kittisak [ORNL; Lee, Seyong [ORNL; Vetter, Jeffrey S [ORNL
2017-01-01
A surprising development in recently announced HPC platforms is the addition of, sometimes massive amounts of, persistent (nonvolatile) memory (NVM) in order to increase memory capacity and compensate for plateauing I/O capabilities. However, there are no portable and scalable programming interfaces using aggregate NVM effectively. This paper introduces Papyrus: a new software system built to exploit emerging capability of NVM in HPC architectures. Papyrus (or Parallel Aggregate Persistent -YRU- Storage) is a novel programming system that provides features for scalable, aggregate, persistent memory in an extreme-scale system for typical HPC usage scenarios. Papyrus mainly consists of Papyrus Virtual File System (VFS) and Papyrus Template Container Library (TCL). Papyrus VFS provides a uniform aggregate NVM storage image across diverse NVM architectures. It enables Papyrus TCL to provide a portable and scalable high-level container programming interface whose data elements are distributed across multiple NVM nodes without requiring the user to handle complex communication, synchronization, replication, and consistency model. We evaluate Papyrus on two HPC systems, including UTK Beacon and NERSC Cori, using real NVM storage devices.
Parallel programming with Python
Palach, Jan
2014-01-01
A fast, easy-to-follow and clear tutorial to help you develop Parallel computing systems using Python. Along with explaining the fundamentals, the book will also introduce you to slightly advanced concepts and will help you in implementing these techniques in the real world. If you are an experienced Python programmer and are willing to utilize the available computing resources by parallelizing applications in a simple way, then this book is for you. You are required to have a basic knowledge of Python development to get the most of this book.
OceanXtremes: Scalable Anomaly Detection in Oceanographic Time-Series
Wilson, B. D.; Armstrong, E. M.; Chin, T. M.; Gill, K. M.; Greguska, F. R., III; Huang, T.; Jacob, J. C.; Quach, N.
2016-12-01
The oceanographic community must meet the challenge to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data-intensive reality, we are developing an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of 15 to 30-year ocean science datasets.Our parallel analytics engine is extending the NEXUS system and exploits multiple open-source technologies: Apache Cassandra as a distributed spatial "tile" cache, Apache Spark for in-memory parallel computation, and Apache Solr for spatial search and storing pre-computed tile statistics and other metadata. OceanXtremes provides these key capabilities: Parallel generation (Spark on a compute cluster) of 15 to 30-year Ocean Climatologies (e.g. sea surface temperature or SST) in hours or overnight, using simple pixel averages or customizable Gaussian-weighted "smoothing" over latitude, longitude, and time; Parallel pre-computation, tiling, and caching of anomaly fields (daily variables minus a chosen climatology) with pre-computed tile statistics; Parallel detection (over the time-series of tiles) of anomalies or phenomena by regional area-averages exceeding a specified threshold (e.g. high SST in El Nino or SST "blob" regions), or more complex, custom data mining algorithms; Shared discovery and exploration of ocean phenomena and anomalies (facet search using Solr), along with unexpected correlations between key measured variables; Scalable execution for all capabilities on a hybrid Cloud, using our on-premise OpenStack Cloud cluster or at Amazon. The key idea is that the parallel data-mining operations will be run "near" the ocean data archives (a local "network" hop) so that we can efficiently access the thousands of files making up a three decade time
Energy Technology Data Exchange (ETDEWEB)
Jain, Atul K. [Univ. of Illinois, Urbana-Champaign, IL (United States)
2016-09-14
The overall objectives of this DOE funded project is to combine scientific and computational challenges in climate modeling by expanding our understanding of the biogeophysical-biogeochemical processes and their interactions in the northern high latitudes (NHLs) using an earth system modeling (ESM) approach, and by adopting an adaptive parallel runtime system in an ESM to achieve efficient and scalable climate simulations through improved load balancing algorithms.
Scalable Track Detection in SAR CCD Images
Energy Technology Data Exchange (ETDEWEB)
Chow, James G [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Quach, Tu-Thach [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
2017-03-01
Existing methods to detect vehicle tracks in coherent change detection images, a product of combining two synthetic aperture radar images ta ken at different times of the same scene, rely on simple, fast models to label track pixels. These models, however, are often too simple to capture natural track features such as continuity and parallelism. We present a simple convolutional network architecture consisting of a series of 3-by-3 convolutions to detect tracks. The network is trained end-to-end to learn natural track features entirely from data. The network is computationally efficient and improves the F-score on a standard dataset to 0.988, up fr om 0.907 obtained by the current state-of-the-art method.
geoKepler Workflow Module for Computationally Scalable and Reproducible Geoprocessing and Modeling
Cowart, C.; Block, J.; Crawl, D.; Graham, J.; Gupta, A.; Nguyen, M.; de Callafon, R.; Smarr, L.; Altintas, I.
2015-12-01
The NSF-funded WIFIRE project has developed an open-source, online geospatial workflow platform for unifying geoprocessing tools and models for for fire and other geospatially dependent modeling applications. It is a product of WIFIRE's objective to build an end-to-end cyberinfrastructure for real-time and data-driven simulation, prediction and visualization of wildfire behavior. geoKepler includes a set of reusable GIS components, or actors, for the Kepler Scientific Workflow System (https://kepler-project.org). Actors exist for reading and writing GIS data in formats such as Shapefile, GeoJSON, KML, and using OGC web services such as WFS. The actors also allow for calling geoprocessing tools in other packages such as GDAL and GRASS. Kepler integrates functions from multiple platforms and file formats into one framework, thus enabling optimal GIS interoperability, model coupling, and scalability. Products of the GIS actors can be fed directly to models such as FARSITE and WRF. Kepler's ability to schedule and scale processes using Hadoop and Spark also makes geoprocessing ultimately extensible and computationally scalable. The reusable workflows in geoKepler can be made to run automatically when alerted by real-time environmental conditions. Here, we show breakthroughs in the speed of creating complex data for hazard assessments with this platform. We also demonstrate geoKepler workflows that use Data Assimilation to ingest real-time weather data into wildfire simulations, and for data mining techniques to gain insight into environmental conditions affecting fire behavior. Existing machine learning tools and libraries such as R and MLlib are being leveraged for this purpose in Kepler, as well as Kepler's Distributed Data Parallel (DDP) capability to provide a framework for scalable processing. geoKepler workflows can be executed via an iPython notebook as a part of a Jupyter hub at UC San Diego for sharing and reporting of the scientific analysis and results from
Fast and Scalable Computation of the Forward and Inverse Discrete Periodic Radon Transform.
Carranza, Cesar; Llamocca, Daniel; Pattichis, Marios
2016-01-01
The discrete periodic radon transform (DPRT) has extensively been used in applications that involve image reconstructions from projections. Beyond classic applications, the DPRT can also be used to compute fast convolutions that avoids the use of floating-point arithmetic associated with the use of the fast Fourier transform. Unfortunately, the use of the DPRT has been limited by the need to compute a large number of additions and the need for a large number of memory accesses. This paper introduces a fast and scalable approach for computing the forward and inverse DPRT that is based on the use of: a parallel array of fixed-point adder trees; circular shift registers to remove the need for accessing external memory components when selecting the input data for the adder trees; an image block-based approach to DPRT computation that can fit the proposed architecture to available resources; and fast transpositions that are computed in one or a few clock cycles that do not depend on the size of the input image. As a result, for an N × N image (N prime), the proposed approach can compute up to N(2) additions per clock cycle. Compared with the previous approaches, the scalable approach provides the fastest known implementations for different amounts of computational resources. For example, for a 251×251 image, for approximately 25% fewer flip-flops than required for a systolic implementation, we have that the scalable DPRT is computed 36 times faster. For the fastest case, we introduce optimized just 2N + ⌈log(2) N⌉ + 1 and 2N + 3 ⌈log(2) N⌉ + B + 2 cycles, architectures that can compute the DPRT and its inverse in respectively, where B is the number of bits used to represent each input pixel. On the other hand, the scalable DPRT approach requires more 1-b additions than for the systolic implementation and provides a tradeoff between speed and additional 1-b additions. All of the proposed DPRT architectures were implemented in VHSIC Hardware Description Language
Expressing Parallelism with ROOT
Energy Technology Data Exchange (ETDEWEB)
Piparo, D. [CERN; Tejedor, E. [CERN; Guiraud, E. [CERN; Ganis, G. [CERN; Mato, P. [CERN; Moneta, L. [CERN; Valls Pla, X. [CERN; Canal, P. [Fermilab
2017-11-22
The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.
Expressing Parallelism with ROOT
Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.
2017-10-01
The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.
Parallel Fast Legendre Transform
Alves de Inda, M.; Bisseling, R.H.; Maslen, D.K.
1998-01-01
We discuss a parallel implementation of a fast algorithm for the discrete polynomial Legendre transform We give an introduction to the DriscollHealy algorithm using polynomial arithmetic and present experimental results on the eciency and accuracy of our implementation The algorithms were
Practical parallel programming
Bauer, Barr E
2014-01-01
This is the book that will teach programmers to write faster, more efficient code for parallel processors. The reader is introduced to a vast array of procedures and paradigms on which actual coding may be based. Examples and real-life simulations using these devices are presented in C and FORTRAN.
Parallel universes beguile science
2007-01-01
A staple of mind-bending science fiction, the possibility of multiple universes has long intrigued hard-nosed physicists, mathematicians and cosmologists too. We may not be able -- as least not yet -- to prove they exist, many serious scientists say, but there are plenty of reasons to think that parallel dimensions are more than figments of eggheaded imagination.
Energy Technology Data Exchange (ETDEWEB)
2017-04-04
A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique. We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.
International Nuclear Information System (INIS)
Gardes, D.; Volkov, P.
1981-01-01
A 5x3cm 2 (timing only) and a 15x5cm 2 (timing and position) parallel plate avalanche counters (PPAC) are considered. The theory of operation and timing resolution is given. The measurement set-up and the curves of experimental results illustrate the possibilities of the two counters [fr
Parallel hierarchical global illumination
Energy Technology Data Exchange (ETDEWEB)
Snell, Quinn O. [Iowa State Univ., Ames, IA (United States)
1997-10-08
Solving the global illumination problem is equivalent to determining the intensity of every wavelength of light in all directions at every point in a given scene. The complexity of the problem has led researchers to use approximation methods for solving the problem on serial computers. Rather than using an approximation method, such as backward ray tracing or radiosity, the authors have chosen to solve the Rendering Equation by direct simulation of light transport from the light sources. This paper presents an algorithm that solves the Rendering Equation to any desired accuracy, and can be run in parallel on distributed memory or shared memory computer systems with excellent scaling properties. It appears superior in both speed and physical correctness to recent published methods involving bidirectional ray tracing or hybrid treatments of diffuse and specular surfaces. Like progressive radiosity methods, it dynamically refines the geometry decomposition where required, but does so without the excessive storage requirements for ray histories. The algorithm, called Photon, produces a scene which converges to the global illumination solution. This amounts to a huge task for a 1997-vintage serial computer, but using the power of a parallel supercomputer significantly reduces the time required to generate a solution. Currently, Photon can be run on most parallel environments from a shared memory multiprocessor to a parallel supercomputer, as well as on clusters of heterogeneous workstations.
Novel Parallel Numerical Methods for Radiation and Neutron Transport
International Nuclear Information System (INIS)
Brown, P N
2001-01-01
In many of the multiphysics simulations performed at LLNL, transport calculations can take up 30 to 50% of the total run time. If Monte Carlo methods are used, the percentage can be as high as 80%. Thus, a significant core competence in the formulation, software implementation, and solution of the numerical problems arising in transport modeling is essential to Laboratory and DOE research. In this project, we worked on developing scalable solution methods for the equations that model the transport of photons and neutrons through materials. Our goal was to reduce the transport solve time in these simulations by means of more advanced numerical methods and their parallel implementations. These methods must be scalable, that is, the time to solution must remain constant as the problem size grows and additional computer resources are used. For iterative methods, scalability requires that (1) the number of iterations to reach convergence is independent of problem size, and (2) that the computational cost grows linearly with problem size. We focused on deterministic approaches to transport, building on our earlier work in which we performed a new, detailed analysis of some existing transport methods and developed new approaches. The Boltzmann equation (the underlying equation to be solved) and various solution methods have been developed over many years. Consequently, many laboratory codes are based on these methods, which are in some cases decades old. For the transport of x-rays through partially ionized plasmas in local thermodynamic equilibrium, the transport equation is coupled to nonlinear diffusion equations for the electron and ion temperatures via the highly nonlinear Planck function. We investigated the suitability of traditional-solution approaches to transport on terascale architectures and also designed new scalable algorithms; in some cases, we investigated hybrid approaches that combined both
Parallel Algorithms for Graph Optimization using Tree Decompositions
Energy Technology Data Exchange (ETDEWEB)
Sullivan, Blair D [ORNL; Weerapurage, Dinesh P [ORNL; Groer, Christopher S [ORNL
2012-06-01
Although many $\\cal{NP}$-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the necessary dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task-based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Parallel k-means++ for Multiple Shared-Memory Architectures
Energy Technology Data Exchange (ETDEWEB)
Mackey, Patrick S.; Lewis, Robert R.
2016-09-22
In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varying data sizes.
A hybrid method for the parallel computation of Green's functions
DEFF Research Database (Denmark)
Petersen, Dan Erik; Li, Song; Stokbro, Kurt
2009-01-01
of the large number of times this calculation needs to be performed, this is computationally very expensive even on supercomputers. The classical approach is based on recurrence formulas which cannot be efficiently parallelized. This practically prevents the solution of large problems with hundreds...... of thousands of atoms. We propose new recurrences for a general class of sparse matrices to calculate Green's and lesser Green's function matrices which extend formulas derived by Takahashi and others. We show that these recurrences may lead to a dramatically reduced computational cost because they only...... require computing a small number of entries of the inverse matrix. Then. we propose a parallelization strategy for block tridiagonal matrices which involves a combination of Schur complement calculations and cyclic reduction. It achieves good scalability even on problems of modest size....
Large-Scale Parallel Viscous Flow Computations using an Unstructured Multigrid Algorithm
Mavriplis, Dimitri J.
1999-01-01
The development and testing of a parallel unstructured agglomeration multigrid algorithm for steady-state aerodynamic flows is discussed. The agglomeration multigrid strategy uses a graph algorithm to construct the coarse multigrid levels from the given fine grid, similar to an algebraic multigrid approach, but operates directly on the non-linear system using the FAS (Full Approximation Scheme) approach. The scalability and convergence rate of the multigrid algorithm are examined on the SGI Origin 2000 and the Cray T3E. An argument is given which indicates that the asymptotic scalability of the multigrid algorithm should be similar to that of its underlying single grid smoothing scheme. For medium size problems involving several million grid points, near perfect scalability is obtained for the single grid algorithm, while only a slight drop-off in parallel efficiency is observed for the multigrid V- and W-cycles, using up to 128 processors on the SGI Origin 2000, and up to 512 processors on the Cray T3E. For a large problem using 25 million grid points, good scalability is observed for the multigrid algorithm using up to 1450 processors on a Cray T3E, even when the coarsest grid level contains fewer points than the total number of processors.
Quality Scalability Compression on Single-Loop Solution in HEVC
Directory of Open Access Journals (Sweden)
Mengmeng Zhang
2014-01-01
Full Text Available This paper proposes a quality scalable extension design for the upcoming high efficiency video coding (HEVC standard. In the proposed design, the single-loop decoder solution is extended into the proposed scalable scenario. A novel interlayer intra/interprediction is added to reduce the amount of bits representation by exploiting the correlation between coding layers. The experimental results indicate that the average Bjøntegaard delta rate decrease of 20.50% can be gained compared with the simulcast encoding. The proposed technique achieved 47.98% Bjøntegaard delta rate reduction compared with the scalable video coding extension of the H.264/AVC. Consequently, significant rate savings confirm that the proposed method achieves better performance.
Frame-Based and Subpicture-Based Parallelization Approaches of the HEVC Video Encoder
Directory of Open Access Journals (Sweden)
Héctor Migallón
2018-05-01
Full Text Available The most recent video coding standard, High Efficiency Video Coding (HEVC, is able to significantly improve the compression performance at the expense of a huge computational complexity increase with respect to its predecessor, H.264/AVC. Parallel versions of the HEVC encoder may help to reduce the overall encoding time in order to make it more suitable for practical applications. In this work, we study two parallelization strategies. One of them follows a coarse-grain approach, where parallelization is based on frames, and the other one follows a fine-grain approach, where parallelization is performed at subpicture level. Two different frame-based approaches have been developed. The first one only uses MPI and the second one is a hybrid MPI/OpenMP algorithm. An exhaustive experimental test was carried out to study the performance of both approaches in order to find out the best setup in terms of parallel efficiency and coding performance. Both frame-based and subpicture-based approaches are compared under the same hardware platform. Although subpicture-based schemes provide an excellent performance with high-resolution video sequences, scalability is limited by resolution, and the coding performance worsens by increasing the number of processes. Conversely, the proposed frame-based approaches provide the best results with respect to both parallel performance (increasing scalability and coding performance (not degrading the rate/distortion behavior.
Micro-Scalable Thermal Control Device
Moran, Matthew E. (Inventor)
2002-01-01
A microscalable thermal control module consists of a Stirling cycle cooler that can be manipulated to operate at a selected temperature within the heating and cooling range of the module. The microscalable thermal control module is particularly suited for controlling the temperature of devices that must be maintained at precise temperatures. It is particularly suited for controlling the temperature of devices that need to be alternately heated or cooled. The module contains upper and lower opposing diaphragms, with a regenerator region containing a plurality of regenerators interposed between the diaphragms. Gaps exist on each side of each diaphragm to permit it to oscillate freely. The gap on the interior side one diaphragm is in fluid connection with the gap on the interior side of the other diaphragm through regenerators. As the diaphragms oscillate working gas is forced through the regenerators. The surface area of each regenerator is sufficiently large to effectively transfer thermal energy to and from the working gas as it is passed through them. The phase and amplitude of the oscillations can be manipulated electronically to control the steady state temperature of the active thermal control surface, and to switch the operation of the module from cooling to heating, or vice versa. The ability of the microscalable thermal control module to heat and cool may be enhanced by operating a plurality of modules in series, in parallel, or in connection through a shared bottom layer.
Scalable DeNoise-and-Forward in Bidirectional Relay Networks
DEFF Research Database (Denmark)
Sørensen, Jesper Hemming; Krigslund, Rasmus; Popovski, Petar
2010-01-01
In this paper a scalable relaying scheme is proposed based on an existing concept called DeNoise-and-Forward, DNF. We call it Scalable DNF, S-DNF, and it targets the scenario with multiple communication flows through a single common relay. The idea of the scheme is to combine packets at the relay...... in order to save transmissions. To ensure decodability at the end-nodes, a priori information about the content of the combined packets must be available. This is gathered during the initial transmissions to the relay. The trade-off between decodability and number of necessary transmissions is analysed...
Weighted Local Active Pixel Pattern (WLAPP for Face Recognition in Parallel Computation Environment
Directory of Open Access Journals (Sweden)
Gundavarapu Mallikarjuna Rao
2013-10-01
Full Text Available Abstract - The availability of multi-core technology resulted totally new computational era. Researchers are keen to explore available potential in state of art-machines for breaking the bearer imposed by serial computation. Face Recognition is one of the challenging applications on so ever computational environment. The main difficulty of traditional Face Recognition algorithms is lack of the scalability. In this paper Weighted Local Active Pixel Pattern (WLAPP, a new scalable Face Recognition Algorithm suitable for parallel environment is proposed. Local Active Pixel Pattern (LAPP is found to be simple and computational inexpensive compare to Local Binary Patterns (LBP. WLAPP is developed based on concept of LAPP. The experimentation is performed on FG-Net Aging Database with deliberately introduced 20% distortion and the results are encouraging. Keywords — Active pixels, Face Recognition, Local Binary Pattern (LBP, Local Active Pixel Pattern (LAPP, Pattern computing, parallel workers, template, weight computation.
PVeStA: A Parallel Statistical Model Checking and Quantitative Analysis Tool
AlTurki, Musab
2011-01-01
Statistical model checking is an attractive formal analysis method for probabilistic systems such as, for example, cyber-physical systems which are often probabilistic in nature. This paper is about drastically increasing the scalability of statistical model checking, and making such scalability of analysis available to tools like Maude, where probabilistic systems can be specified at a high level as probabilistic rewrite theories. It presents PVeStA, an extension and parallelization of the VeStA statistical model checking tool [10]. PVeStA supports statistical model checking of probabilistic real-time systems specified as either: (i) discrete or continuous Markov Chains; or (ii) probabilistic rewrite theories in Maude. Furthermore, the properties that it can model check can be expressed in either: (i) PCTL/CSL, or (ii) the QuaTEx quantitative temporal logic. As our experiments show, the performance gains obtained from parallelization can be very high. © 2011 Springer-Verlag.
Proxy-equation paradigm: A strategy for massively parallel asynchronous computations
Mittal, Ankita; Girimaji, Sharath
2017-09-01
Massively parallel simulations of transport equation systems call for a paradigm change in algorithm development to achieve efficient scalability. Traditional approaches require time synchronization of processing elements (PEs), which severely restricts scalability. Relaxing synchronization requirement introduces error and slows down convergence. In this paper, we propose and develop a novel "proxy equation" concept for a general transport equation that (i) tolerates asynchrony with minimal added error, (ii) preserves convergence order and thus, (iii) expected to scale efficiently on massively parallel machines. The central idea is to modify a priori the transport equation at the PE boundaries to offset asynchrony errors. Proof-of-concept computations are performed using a one-dimensional advection (convection) diffusion equation. The results demonstrate the promise and advantages of the present strategy.
Gunnels, John; Lee, Jon; Margulies, Susan
2010-01-01
We provide a first demonstration of the idea that matrix-based algorithms for nonlinear combinatorial optimization problems can be efficiently implemented. Such algorithms were mainly conceived by theoretical computer scientists for proving efficiency. We are able to demonstrate the practicality of our approach by developing an implementation on a massively parallel architecture, and exploiting scalable and efficient parallel implementations of algorithms for ultra high-precision linear algebra. Additionally, we have delineated and implemented the necessary algorithmic and coding changes required in order to address problems several orders of magnitude larger, dealing with the limits of scalability from memory footprint, computational efficiency, reliability, and interconnect perspectives. © Springer and Mathematical Programming Society 2010.
Gunnels, John
2010-06-01
We provide a first demonstration of the idea that matrix-based algorithms for nonlinear combinatorial optimization problems can be efficiently implemented. Such algorithms were mainly conceived by theoretical computer scientists for proving efficiency. We are able to demonstrate the practicality of our approach by developing an implementation on a massively parallel architecture, and exploiting scalable and efficient parallel implementations of algorithms for ultra high-precision linear algebra. Additionally, we have delineated and implemented the necessary algorithmic and coding changes required in order to address problems several orders of magnitude larger, dealing with the limits of scalability from memory footprint, computational efficiency, reliability, and interconnect perspectives. © Springer and Mathematical Programming Society 2010.
Wald, Ingo; Ize, Santiago
2015-07-28
Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.
Ultrascalable petaflop parallel supercomputer
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Chiu, George [Cross River, NY; Cipolla, Thomas M [Katonah, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Hall, Shawn [Pleasantville, NY; Haring, Rudolf A [Cortlandt Manor, NY; Heidelberger, Philip [Cortlandt Manor, NY; Kopcsay, Gerard V [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Salapura, Valentina [Chappaqua, NY; Sugavanam, Krishnan [Mahopac, NY; Takken, Todd [Brewster, NY
2010-07-20
A massively parallel supercomputer of petaOPS-scale includes node architectures based upon System-On-a-Chip technology, where each processing node comprises a single Application Specific Integrated Circuit (ASIC) having up to four processing elements. The ASIC nodes are interconnected by multiple independent networks that optimally maximize the throughput of packet communications between nodes with minimal latency. The multiple networks may include three high-speed networks for parallel algorithm message passing including a Torus, collective network, and a Global Asynchronous network that provides global barrier and notification functions. These multiple independent networks may be collaboratively or independently utilized according to the needs or phases of an algorithm for optimizing algorithm processing performance. The use of a DMA engine is provided to facilitate message passing among the nodes without the expenditure of processing resources at the node.
DEFF Research Database (Denmark)
Gregersen, Frans; Josephson, Olle; Kristoffersen, Gjert
of departure that English may be used in parallel with the various local, in this case Nordic, languages. As such, the book integrates the challenge of internationalization faced by any university with the wish to improve quality in research, education and administration based on the local language......Abstract [en] More parallel, please is the result of the work of an Inter-Nordic group of experts on language policy financed by the Nordic Council of Ministers 2014-17. The book presents all that is needed to plan, practice and revise a university language policy which takes as its point......(s). There are three layers in the text: First, you may read the extremely brief version of the in total 11 recommendations for best practice. Second, you may acquaint yourself with the extended version of the recommendations and finally, you may study the reasoning behind each of them. At the end of the text, we give...
PARALLEL MOVING MECHANICAL SYSTEMS
Directory of Open Access Journals (Sweden)
Florian Ion Tiberius Petrescu
2014-09-01
Full Text Available Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 Moving mechanical systems parallel structures are solid, fast, and accurate. Between parallel systems it is to be noticed Stewart platforms, as the oldest systems, fast, solid and precise. The work outlines a few main elements of Stewart platforms. Begin with the geometry platform, kinematic elements of it, and presented then and a few items of dynamics. Dynamic primary element on it means the determination mechanism kinetic energy of the entire Stewart platforms. It is then in a record tail cinematic mobile by a method dot matrix of rotation. If a structural mottoelement consists of two moving elements which translates relative, drive train and especially dynamic it is more convenient to represent the mottoelement as a single moving components. We have thus seven moving parts (the six motoelements or feet to which is added mobile platform 7 and one fixed.
Xyce parallel electronic simulator.
Energy Technology Data Exchange (ETDEWEB)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.; Rankin, Eric Lamont; Schiek, Richard Louis; Thornquist, Heidi K.; Fixel, Deborah A.; Coffey, Todd S; Pawlowski, Roger P; Santarelli, Keith R.
2010-05-01
This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide.
Betchov, R
2012-01-01
Stability of Parallel Flows provides information pertinent to hydrodynamical stability. This book explores the stability problems that occur in various fields, including electronics, mechanics, oceanography, administration, economics, as well as naval and aeronautical engineering. Organized into two parts encompassing 10 chapters, this book starts with an overview of the general equations of a two-dimensional incompressible flow. This text then explores the stability of a laminar boundary layer and presents the equation of the inviscid approximation. Other chapters present the general equation
Algorithmically specialized parallel computers
Snyder, Lawrence; Gannon, Dennis B
1985-01-01
Algorithmically Specialized Parallel Computers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster
Distributed and cloud computing from parallel processing to the Internet of Things
Hwang, Kai; Fox, Geoffrey C
2012-01-01
Distributed and Cloud Computing, named a 2012 Outstanding Academic Title by the American Library Association's Choice publication, explains how to create high-performance, scalable, reliable systems, exposing the design principles, architecture, and innovative applications of parallel, distributed, and cloud computing systems. Starting with an overview of modern distributed models, the book provides comprehensive coverage of distributed and cloud computing, including: Facilitating management, debugging, migration, and disaster recovery through virtualization Clustered systems for resear
Scalable storage for a DBMS using transparent distribution
J.S. Karlsson; M.L. Kersten (Martin)
1997-01-01
textabstractScalable Distributed Data Structures (SDDSs) provide a self-managing and self-organizing data storage of potentially unbounded size. This stands in contrast to common distribution schemas deployed in conventional distributed DBMS. SDDSs, however, have mostly been used in synthetic
Scalable force directed graph layout algorithms using fast multipole methods
Yunis, Enas Abdulrahman; Yokota, Rio; Ahmadia, Aron
2012-01-01
We present an extension to ExaFMM, a Fast Multipole Method library, as a generalized approach for fast and scalable execution of the Force-Directed Graph Layout algorithm. The Force-Directed Graph Layout algorithm is a physics-based approach
Cascaded column generation for scalable predictive demand side management
Toersche, Hermen; Molderink, Albert; Hurink, Johann L.; Smit, Gerardus Johannes Maria
2014-01-01
We propose a nested Dantzig-Wolfe decomposition, combined with dynamic programming, for the distributed scheduling of a large heterogeneous fleet of residential appliances with nonlinear behavior. A cascaded column generation approach gives a scalable optimization strategy, provided that the problem
Scalable power selection method for wireless mesh networks
CSIR Research Space (South Africa)
Olwal, TO
2009-01-01
Full Text Available This paper addresses the problem of a scalable dynamic power control (SDPC) for wireless mesh networks (WMNs) based on IEEE 802.11 standards. An SDPC model that accounts for architectural complexities witnessed in multiple radios and hops...
Efficient Enhancement for Spatial Scalable Video Coding Transmission
Directory of Open Access Journals (Sweden)
Mayada Khairy
2017-01-01
Full Text Available Scalable Video Coding (SVC is an international standard technique for video compression. It is an extension of H.264 Advanced Video Coding (AVC. In the encoding of video streams by SVC, it is suitable to employ the macroblock (MB mode because it affords superior coding efficiency. However, the exhaustive mode decision technique that is usually used for SVC increases the computational complexity, resulting in a longer encoding time (ET. Many other algorithms were proposed to solve this problem with imperfection of increasing transmission time (TT across the network. To minimize the ET and TT, this paper introduces four efficient algorithms based on spatial scalability. The algorithms utilize the mode-distribution correlation between the base layer (BL and enhancement layers (ELs and interpolation between the EL frames. The proposed algorithms are of two categories. Those of the first category are based on interlayer residual SVC spatial scalability. They employ two methods, namely, interlayer interpolation (ILIP and the interlayer base mode (ILBM method, and enable ET and TT savings of up to 69.3% and 83.6%, respectively. The algorithms of the second category are based on full-search SVC spatial scalability. They utilize two methods, namely, full interpolation (FIP and the full-base mode (FBM method, and enable ET and TT savings of up to 55.3% and 76.6%, respectively.
Scalable Robust Principal Component Analysis Using Grassmann Averages
DEFF Research Database (Denmark)
Hauberg, Søren; Feragen, Aasa; Enficiaud, Raffi
2016-01-01
In large datasets, manual data verification is impossible, and we must expect the number of outliers to increase with data size. While principal component analysis (PCA) can reduce data size, and scalable solutions exist, it is well-known that outliers can arbitrarily corrupt the results. Unfortu...
A Scalable Smart Meter Data Generator Using Spark
DEFF Research Database (Denmark)
Iftikhar, Nadeem; Liu, Xiufeng; Danalachi, Sergiu
2017-01-01
Today, smart meters are being used worldwide. As a matter of fact smart meters produce large volumes of data. Thus, it is important for smart meter data management and analytics systems to process petabytes of data. Benchmarking and testing of these systems require scalable data, however, it can ...
Scalability and efficiency of genetic algorithms for geometrical applications
Dijk, van S.F.; Thierens, D.; Berg, de M.; Schoenauer, M.
2000-01-01
We study the scalability and efficiency of a GA that we developed earlier to solve the practical cartographic problem of labeling a map with point features. We argue that the special characteristics of our GA make that it fits in well with theoretical models predicting the optimal population size
Scalable electro-photonic integration concept based on polymer waveguides
Bosman, E.; Steenberge, G. van; Boersma, A.; Wiegersma, S.; Harmsma, P.J.; Karppinen, M.; Korhonen, T.; Offrein, B.J.; Dangel, R.; Daly, A.; Ortsiefer, M.; Justice, J.; Corbett, B.; Dorrestein, S.; Duis, J.
2016-01-01
A novel method for fabricating a single mode optical interconnection platform is presented. The method comprises the miniaturized assembly of optoelectronic single dies, the scalable fabrication of polymer single mode waveguides and the coupling to glass fiber arrays providing the I/O's. The low
A Massively Scalable Architecture for Instant Messaging & Presence
Schippers, Jorrit; Remke, Anne Katharina Ingrid; Punt, Henk; Wegdam, M.; Haverkort, Boudewijn R.H.M.; Thomas, N.; Bradley, J.; Knottenbelt, W.; Dingle, N.; Harder, U.
2010-01-01
This paper analyzes the scalability of Instant Messaging & Presence (IM&P) architectures. We take a queueing-based modelling and analysis approach to ��?nd the bottlenecks of the current IM&P architecture at the Dutch social network Hyves, as well as of alternative architectures. We use the
Adolescent sexuality education: An appraisal of some scalable ...
African Journals Online (AJOL)
Adolescent sexuality education: An appraisal of some scalable interventions for the Nigerian context. VC Pam. Abstract. Most issues around sexual intercourse are highly sensitive topics in Nigeria. Despite the disturbingly high adolescent HIV prevalence and teenage pregnancy rate in Nigeria, sexuality education is ...
Scalable multifunction RF system concepts for joint operations
Otten, M.P.G.; Wit, J.J.M. de; Smits, F.M.A.; Rossum, W.L. van; Huizing, A.
2010-01-01
RF systems based on modular architectures have the potential of better re-use of technology, decreasing development time, and decreasing life cycle cost. Moreover, modular architectures provide scalability, allowing low cost upgrades and adaptability to different platforms. To achieve maximum
Estimates of the Sampling Distribution of Scalability Coefficient H
Van Onna, Marieke J. H.
2004-01-01
Coefficient "H" is used as an index of scalability in nonparametric item response theory (NIRT). It indicates the degree to which a set of items rank orders examinees. Theoretical sampling distributions, however, have only been derived asymptotically and only under restrictive conditions. Bootstrap methods offer an alternative possibility to…
PyClaw: Accessible, Extensible, Scalable Tools for Wave Propagation Problems
Ketcheson, David I.
2012-08-15
Development of scientific software involves tradeoffs between ease of use, generality, and performance. We describe the design of a general hyperbolic PDE solver that can be operated with the convenience of MATLAB yet achieves efficiency near that of hand-coded Fortran and scales to the largest supercomputers. This is achieved by using Python for most of the code while employing automatically wrapped Fortran kernels for computationally intensive routines, and using Python bindings to interface with a parallel computing library and other numerical packages. The software described here is PyClaw, a Python-based structured grid solver for general systems of hyperbolic PDEs [K. T. Mandli et al., PyClaw Software, Version 1.0, http://numerics.kaust.edu.sa/pyclaw/ (2011)]. PyClaw provides a powerful and intuitive interface to the algorithms of the existing Fortran codes Clawpack and SharpClaw, simplifying code development and use while providing massive parallelism and scalable solvers via the PETSc library. The package is further augmented by use of PyWENO for generation of efficient high-order weighted essentially nonoscillatory reconstruction code. The simplicity, capability, and performance of this approach are demonstrated through application to example problems in shallow water flow, compressible flow, and elasticity.
Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures
Mills, R. T.; Hoffman, F. M.; Kumar, J.; Sreepathi, S.; Sripathi, V.
2016-12-01
The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe a massively parallel implementation of accelerated k-means clustering and some optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture, which includes several new features, including an on-package high-bandwidth memory. We also analyze the code in the context of a few practical applications to the analysis of climatic and remotely-sensed vegetation phenology data sets, and speculate on some of the new applications that such scalable analysis methods may enable.
Ovcharenko, Aleksandr
2012-03-01
This paper introduces a general-purpose communication package built on top of MPI which is aimed at improving inter-processor communications independently of the supercomputer architecture being considered. The package is developed to support parallel applications that rely on computation characterized by large number of messages of various sizes, often small, that are focused within processor neighborhoods. In some cases, such as solvers having static mesh partitions, the number and size of messages are known a priori. However, in other cases such as mesh adaptation, the messages evolve and vary in number and size and include the dynamic movement of partition objects. The current package provides a utility for dynamic applications based on two key attributes that are: (i) explicit consideration of the neighborhood communication pattern to avoid many-to-many calls and also to reduce the number of collective calls to a minimum, and (ii) use of non-blocking MPI functions along with message packing to manage message flow control and reduce the number and time of communication calls. The test application demonstrated is parallel unstructured mesh adaptation. Results on IBM Blue Gene/P and Cray XE6 computers show that the use of neighborhood-based communication control leads to scalable results when executing generally imbalanced mesh adaptation runs. © 2011 Elsevier B.V. All rights reserved.
PyClaw: Accessible, Extensible, Scalable Tools for Wave Propagation Problems
Ketcheson, David I.; Mandli, Kyle; Ahmadia, Aron; Alghamdi, Amal; de Luna, Manuel Quezada; Parsani, Matteo; Knepley, Matthew G.; Emmett, Matthew
2012-01-01
Development of scientific software involves tradeoffs between ease of use, generality, and performance. We describe the design of a general hyperbolic PDE solver that can be operated with the convenience of MATLAB yet achieves efficiency near that of hand-coded Fortran and scales to the largest supercomputers. This is achieved by using Python for most of the code while employing automatically wrapped Fortran kernels for computationally intensive routines, and using Python bindings to interface with a parallel computing library and other numerical packages. The software described here is PyClaw, a Python-based structured grid solver for general systems of hyperbolic PDEs [K. T. Mandli et al., PyClaw Software, Version 1.0, http://numerics.kaust.edu.sa/pyclaw/ (2011)]. PyClaw provides a powerful and intuitive interface to the algorithms of the existing Fortran codes Clawpack and SharpClaw, simplifying code development and use while providing massive parallelism and scalable solvers via the PETSc library. The package is further augmented by use of PyWENO for generation of efficient high-order weighted essentially nonoscillatory reconstruction code. The simplicity, capability, and performance of this approach are demonstrated through application to example problems in shallow water flow, compressible flow, and elasticity.
A compact linear accelerator based on a scalable microelectromechanical-system RF-structure
Persaud, A.; Ji, Q.; Feinberg, E.; Seidl, P. A.; Waldron, W. L.; Schenkel, T.; Lal, A.; Vinayakumar, K. B.; Ardanuc, S.; Hammer, D. A.
2017-06-01
A new approach for a compact radio-frequency (RF) accelerator structure is presented. The new accelerator architecture is based on the Multiple Electrostatic Quadrupole Array Linear Accelerator (MEQALAC) structure that was first developed in the 1980s. The MEQALAC utilized RF resonators producing the accelerating fields and providing for higher beam currents through parallel beamlets focused using arrays of electrostatic quadrupoles (ESQs). While the early work obtained ESQs with lateral dimensions on the order of a few centimeters, using a printed circuit board (PCB), we reduce the characteristic dimension to the millimeter regime, while massively scaling up the potential number of parallel beamlets. Using Microelectromechanical systems scalable fabrication approaches, we are working on further reducing the characteristic dimension to the sub-millimeter regime. The technology is based on RF-acceleration components and ESQs implemented in the PCB or silicon wafers where each beamlet passes through beam apertures in the wafer. The complete accelerator is then assembled by stacking these wafers. This approach has the potential for fast and inexpensive batch fabrication of the components and flexibility in system design for application specific beam energies and currents. For prototyping the accelerator architecture, the components have been fabricated using the PCB. In this paper, we present proof of concept results of the principal components using the PCB: RF acceleration and ESQ focusing. Ongoing developments on implementing components in silicon and scaling of the accelerator technology to high currents and beam energies are discussed.
Scalable Nonlinear Solvers for Fully Implicit Coupled Nuclear Fuel Modeling. Final Report
International Nuclear Information System (INIS)
Cai, Xiao-Chuan; Yang, Chao; Pernice, Michael
2014-01-01
The focus of the project is on the development and customization of some highly scalable domain decomposition based preconditioning techniques for the numerical solution of nonlinear, coupled systems of partial differential equations (PDEs) arising from nuclear fuel simulations. These high-order PDEs represent multiple interacting physical fields (for example, heat conduction, oxygen transport, solid deformation), each is modeled by a certain type of Cahn-Hilliard and/or Allen-Cahn equations. Most existing approaches involve a careful splitting of the fields and the use of field-by-field iterations to obtain a solution of the coupled problem. Such approaches have many advantages such as ease of implementation since only single field solvers are needed, but also exhibit disadvantages. For example, certain nonlinear interactions between the fields may not be fully captured, and for unsteady problems, stable time integration schemes are difficult to design. In addition, when implemented on large scale parallel computers, the sequential nature of the field-by-field iterations substantially reduces the parallel efficiency. To overcome the disadvantages, fully coupled approaches have been investigated in order to obtain full physics simulations.
A compact linear accelerator based on a scalable microelectromechanical-system RF-structure.
Persaud, A; Ji, Q; Feinberg, E; Seidl, P A; Waldron, W L; Schenkel, T; Lal, A; Vinayakumar, K B; Ardanuc, S; Hammer, D A
2017-06-01
A new approach for a compact radio-frequency (RF) accelerator structure is presented. The new accelerator architecture is based on the Multiple Electrostatic Quadrupole Array Linear Accelerator (MEQALAC) structure that was first developed in the 1980s. The MEQALAC utilized RF resonators producing the accelerating fields and providing for higher beam currents through parallel beamlets focused using arrays of electrostatic quadrupoles (ESQs). While the early work obtained ESQs with lateral dimensions on the order of a few centimeters, using a printed circuit board (PCB), we reduce the characteristic dimension to the millimeter regime, while massively scaling up the potential number of parallel beamlets. Using Microelectromechanical systems scalable fabrication approaches, we are working on further reducing the characteristic dimension to the sub-millimeter regime. The technology is based on RF-acceleration components and ESQs implemented in the PCB or silicon wafers where each beamlet passes through beam apertures in the wafer. The complete accelerator is then assembled by stacking these wafers. This approach has the potential for fast and inexpensive batch fabrication of the components and flexibility in system design for application specific beam energies and currents. For prototyping the accelerator architecture, the components have been fabricated using the PCB. In this paper, we present proof of concept results of the principal components using the PCB: RF acceleration and ESQ focusing. Ongoing developments on implementing components in silicon and scaling of the accelerator technology to high currents and beam energies are discussed.
International Nuclear Information System (INIS)
De Ladurantaye, Vincent; Lavoie, Jean; Bergeron, Jocelyn; Parenteau, Maxime; Lu Huizhong; Pichevar, Ramin; Rouat, Jean
2012-01-01
A parallel implementation of a large spiking neural network is proposed and evaluated. The neural network implements the binding by synchrony process using the Oscillatory Dynamic Link Matcher (ODLM). Scalability, speed and performance are compared for 2 implementations: Message Passing Interface (MPI) and Compute Unified Device Architecture (CUDA) running on clusters of multicore supercomputers and NVIDIA graphical processing units respectively. A global spiking list that represents at each instant the state of the neural network is described. This list indexes each neuron that fires during the current simulation time so that the influence of their spikes are simultaneously processed on all computing units. Our implementation shows a good scalability for very large networks. A complex and large spiking neural network has been implemented in parallel with success, thus paving the road towards real-life applications based on networks of spiking neurons. MPI offers a better scalability than CUDA, while the CUDA implementation on a GeForce GTX 285 gives the best cost to performance ratio. When running the neural network on the GTX 285, the processing speed is comparable to the MPI implementation on RQCHP's Mammouth parallel with 64 notes (128 cores).
Sanan, P.; Tackley, P. J.; Gerya, T.; Kaus, B. J. P.; May, D.
2017-12-01
StagBL is an open-source parallel solver and discretization library for geodynamic simulation,encapsulating and optimizing operations essential to staggered-grid finite volume Stokes flow solvers.It provides a parallel staggered-grid abstraction with a high-level interface in C and Fortran.On top of this abstraction, tools are available to define boundary conditions and interact with particle systems.Tools and examples to efficiently solve Stokes systems defined on the grid are provided in small (direct solver), medium (simple preconditioners), and large (block factorization and multigrid) model regimes.By working directly with leading application codes (StagYY, I3ELVIS, and LaMEM) and providing an API and examples to integrate with others, StagBL aims to become a community tool supplying scalable, portable, reproducible performance toward novel science in regional- and planet-scale geodynamics and planetary science.By implementing kernels used by many research groups beneath a uniform abstraction layer, the library will enable optimization for modern hardware, thus reducing community barriers to large- or extreme-scale parallel simulation on modern architectures. In particular, the library will include CPU-, Manycore-, and GPU-optimized variants of matrix-free operators and multigrid components.The common layer provides a framework upon which to introduce innovative new tools.StagBL will leverage p4est to provide distributed adaptive meshes, and incorporate a multigrid convergence analysis tool.These options, in addition to a wealth of solver options provided by an interface to PETSc, will make the most modern solution techniques available from a common interface. StagBL in turn provides a PETSc interface, DMStag, to its central staggered grid abstraction.We present public version 0.5 of StagBL, including preliminary integration with application codes and demonstrations with its own demonstration application, StagBLDemo. Central to StagBL is the notion of an
International Nuclear Information System (INIS)
Aparicio, G.; Blanquer, I.; Hernandez, V.; Segrelles, D.
2007-01-01
The integration of High-performance computing tools is a key issue in biomedical research. Many computer-based applications have been migrated to High-Performance computers to deal with their computing and storage needs such as BLAST. However, the use of clusters and computing farm presents problems in scalability. The use of a higher layer of parallelism that splits the task into highly independent long jobs that can be executed in parallel can improve the performance maintaining the efficiency. Grid technologies combined with parallel computing resources are an important enabling technology. This work presents a software architecture for executing BLAST in a International Grid Infrastructure that guarantees security, scalability and fault tolerance. The software architecture is modular an adaptable to many other high-throughput applications, both inside the field of bio computing and outside. (Author)
Parallel multiscale simulations of a brain aneurysm
Energy Technology Data Exchange (ETDEWEB)
Grinberg, Leopold [Division of Applied Mathematics, Brown University, Providence, RI 02912 (United States); Fedosov, Dmitry A. [Institute of Complex Systems and Institute for Advanced Simulation, Forschungszentrum Jülich, Jülich 52425 (Germany); Karniadakis, George Em, E-mail: george_karniadakis@brown.edu [Division of Applied Mathematics, Brown University, Providence, RI 02912 (United States)
2013-07-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier–Stokes solver NεκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NεκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in
Parallel multiscale simulations of a brain aneurysm
International Nuclear Information System (INIS)
Grinberg, Leopold; Fedosov, Dmitry A.; Karniadakis, George Em
2013-01-01
Cardiovascular pathologies, such as a brain aneurysm, are affected by the global blood circulation as well as by the local microrheology. Hence, developing computational models for such cases requires the coupling of disparate spatial and temporal scales often governed by diverse mathematical descriptions, e.g., by partial differential equations (continuum) and ordinary differential equations for discrete particles (atomistic). However, interfacing atomistic-based with continuum-based domain discretizations is a challenging problem that requires both mathematical and computational advances. We present here a hybrid methodology that enabled us to perform the first multiscale simulations of platelet depositions on the wall of a brain aneurysm. The large scale flow features in the intracranial network are accurately resolved by using the high-order spectral element Navier–Stokes solver NεκTαr. The blood rheology inside the aneurysm is modeled using a coarse-grained stochastic molecular dynamics approach (the dissipative particle dynamics method) implemented in the parallel code LAMMPS. The continuum and atomistic domains overlap with interface conditions provided by effective forces computed adaptively to ensure continuity of states across the interface boundary. A two-way interaction is allowed with the time-evolving boundary of the (deposited) platelet clusters tracked by an immersed boundary method. The corresponding heterogeneous solvers (NεκTαr and LAMMPS) are linked together by a computational multilevel message passing interface that facilitates modularity and high parallel efficiency. Results of multiscale simulations of clot formation inside the aneurysm in a patient-specific arterial tree are presented. We also discuss the computational challenges involved and present scalability results of our coupled solver on up to 300 K computer processors. Validation of such coupled atomistic-continuum models is a main open issue that has to be addressed in
Scalable Hierarchical Algorithms for stochastic PDEs and UQ
Litvinenko, Alexander; Chá vez, Gustavo; Keyes,David; Ltaief, Hatem; Yokota, Rio
2015-01-01
number of degrees of freedom in the discretization. The storage is reduced to the log-linear as well. This hierarchical structure is a good starting point for parallel algorithms. Parallelization on shared and distributed memory systems was pioneered
Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems
Albutiu, Martina-Cezara; Kemper, Alfons; Neumann, Thomas
2012-01-01
Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel mult...
Extending the POSIX I/O interface: a parallel file system perspective.
Energy Technology Data Exchange (ETDEWEB)
Vilayannur, M.; Lang, S.; Ross, R.; Klundt, R.; Ward, L.; Mathematics and Computer Science; VMWare, Inc.; SNL
2008-12-11
The POSIX interface does not lend itself well to enabling good performance for high-end applications. Extensions are needed in the POSIX I/O interface so that high-concurrency HPC applications running on top of parallel file systems perform well. This paper presents the rationale, design, and evaluation of a reference implementation of a subset of the POSIX I/O interfaces on a widely used parallel file system (PVFS) on clusters. Experimental results on a set of micro-benchmarks confirm that the extensions to the POSIX interface greatly improve scalability and performance.
Parallel processing architecture for H.264 deblocking filter on multi-core platforms
Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao
2012-03-01
Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking
Resistor Combinations for Parallel Circuits.
McTernan, James P.
1978-01-01
To help simplify both teaching and learning of parallel circuits, a high school electricity/electronics teacher presents and illustrates the use of tables of values for parallel resistive circuits in which total resistances are whole numbers. (MF)
SOFTWARE FOR DESIGNING PARALLEL APPLICATIONS
Directory of Open Access Journals (Sweden)
M. K. Bouza
2017-01-01
Full Text Available The object of research is the tools to support the development of parallel programs in C/C ++. The methods and software which automates the process of designing parallel applications are proposed.
Evaluation of 3D printed anatomically scalable transfemoral prosthetic knee.
Ramakrishnan, Tyagi; Schlafly, Millicent; Reed, Kyle B
2017-07-01
This case study compares a transfemoral amputee's gait while using the existing Ossur Total Knee 2000 and our novel 3D printed anatomically scalable transfemoral prosthetic knee. The anatomically scalable transfemoral prosthetic knee is 3D printed out of a carbon-fiber and nylon composite that has a gear-mesh coupling with a hard-stop weight-actuated locking mechanism aided by a cross-linked four-bar spring mechanism. This design can be scaled using anatomical dimensions of a human femur and tibia to have a unique fit for each user. The transfemoral amputee who was tested is high functioning and walked on the Computer Assisted Rehabilitation Environment (CAREN) at a self-selected pace. The motion capture and force data that was collected showed that there were distinct differences in the gait dynamics. The data was used to perform the Combined Gait Asymmetry Metric (CGAM), where the scores revealed that the overall asymmetry of the gait on the Ossur Total Knee was more asymmetric than the anatomically scalable transfemoral prosthetic knee. The anatomically scalable transfemoral prosthetic knee had higher peak knee flexion that caused a large step time asymmetry. This made walking on the anatomically scalable transfemoral prosthetic knee more strenuous due to the compensatory movements in adapting to the different dynamics. This can be overcome by tuning the cross-linked spring mechanism to emulate the dynamics of the subject better. The subject stated that the knee would be good for daily use and has the potential to be adapted as a running knee.
Blind Cooperative Routing for Scalable and Energy-Efficient Internet of Things
Bader, Ahmed; Alouini, Mohamed-Slim
2016-01-01
Multihop networking is promoted in this paper for energy-efficient and highly-scalable Internet of Things (IoT). Recognizing concerns related to the scalability of classical multihop routing and medium access techniques, the use of blind cooperation
Parallel External Memory Graph Algorithms
DEFF Research Database (Denmark)
Arge, Lars Allan; Goodrich, Michael T.; Sitchinava, Nodari
2010-01-01
In this paper, we study parallel I/O efficient graph algorithms in the Parallel External Memory (PEM) model, one o f the private-cache chip multiprocessor (CMP) models. We study the fundamental problem of list ranking which leads to efficient solutions to problems on trees, such as computing lowest...... an optimal speedup of Â¿(P) in parallel I/O complexity and parallel computation time, compared to the single-processor external memory counterparts....
Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A L
2012-06-13
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
Andrade, Xavier; Alberdi-Rodriguez, Joseba; Strubbe, David A.; Oliveira, Micael J. T.; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Louie, Steven G.; Aspuru-Guzik, Alán; Rubio, Angel; Marques, Miguel A. L.
2012-06-01
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures.
International Nuclear Information System (INIS)
Andrade, Xavier; Aspuru-Guzik, Alán; Alberdi-Rodriguez, Joseba; Rubio, Angel; Strubbe, David A; Louie, Steven G; Oliveira, Micael J T; Nogueira, Fernando; Castro, Alberto; Muguerza, Javier; Arruabarrena, Agustin; Marques, Miguel A L
2012-01-01
Octopus is a general-purpose density-functional theory (DFT) code, with a particular emphasis on the time-dependent version of DFT (TDDFT). In this paper we present the ongoing efforts to achieve the parallelization of octopus. We focus on the real-time variant of TDDFT, where the time-dependent Kohn-Sham equations are directly propagated in time. This approach has great potential for execution in massively parallel systems such as modern supercomputers with thousands of processors and graphics processing units (GPUs). For harvesting the potential of conventional supercomputers, the main strategy is a multi-level parallelization scheme that combines the inherent scalability of real-time TDDFT with a real-space grid domain-partitioning approach. A scalable Poisson solver is critical for the efficiency of this scheme. For GPUs, we show how using blocks of Kohn-Sham states provides the required level of data parallelism and that this strategy is also applicable for code optimization on standard processors. Our results show that real-time TDDFT, as implemented in octopus, can be the method of choice for studying the excited states of large molecular systems in modern parallel architectures. (topical review)
Parallel inter channel interaction mechanisms
International Nuclear Information System (INIS)
Jovic, V.; Afgan, N.; Jovic, L.
1995-01-01
Parallel channels interactions are examined. For experimental researches of nonstationary regimes flow in three parallel vertical channels results of phenomenon analysis and mechanisms of parallel channel interaction for adiabatic condition of one-phase fluid and two-phase mixture flow are shown. (author)
International Nuclear Information System (INIS)
Soltz, R; Vranas, P; Blumrich, M; Chen, D; Gara, A; Giampap, M; Heidelberger, P; Salapura, V; Sexton, J; Bhanot, G
2007-01-01
The theory of the strong nuclear force, Quantum Chromodynamics (QCD), can be numerically simulated from first principles on massively-parallel supercomputers using the method of Lattice Gauge Theory. We describe the special programming requirements of lattice QCD (LQCD) as well as the optimal supercomputer hardware architectures that it suggests. We demonstrate these methods on the BlueGene massively-parallel supercomputer and argue that LQCD and the BlueGene architecture are a natural match. This can be traced to the simple fact that LQCD is a regular lattice discretization of space into lattice sites while the BlueGene supercomputer is a discretization of space into compute nodes, and that both are constrained by requirements of locality. This simple relation is both technologically important and theoretically intriguing. The main result of this paper is the speedup of LQCD using up to 131,072 CPUs on the largest BlueGene/L supercomputer. The speedup is perfect with sustained performance of about 20% of peak. This corresponds to a maximum of 70.5 sustained TFlop/s. At these speeds LQCD and BlueGene are poised to produce the next generation of strong interaction physics theoretical results
A Parallel Butterfly Algorithm
Poulson, Jack; Demanet, Laurent; Maxwell, Nicholas; Ying, Lexing
2014-01-01
The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform (Equation Presented.) at large numbers of target points when the kernel, K(x, y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(Nd) quasi-uniformly distributed source and target points, when each appropriate submatrix of K is approximately rank-r, the running time of the algorithm is at most O(r2Nd logN). A parallelization of the butterfly algorithm is introduced which, assuming a message latency of α and per-process inverse bandwidth of β, executes in at most (Equation Presented.) time using p processes. This parallel algorithm was then instantiated in the form of the open-source DistButterfly library for the special case where K(x, y) = exp(iΦ(x, y)), where Φ(x, y) is a black-box, sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for important classes of phase functions. Using quasi-uniform sources, hyperbolic Radon transforms, and an analogue of a three-dimensional generalized Radon transform were, respectively, observed to strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively. © 2014 Society for Industrial and Applied Mathematics.
A Parallel Butterfly Algorithm
Poulson, Jack
2014-02-04
The butterfly algorithm is a fast algorithm which approximately evaluates a discrete analogue of the integral transform (Equation Presented.) at large numbers of target points when the kernel, K(x, y), is approximately low-rank when restricted to subdomains satisfying a certain simple geometric condition. In d dimensions with O(Nd) quasi-uniformly distributed source and target points, when each appropriate submatrix of K is approximately rank-r, the running time of the algorithm is at most O(r2Nd logN). A parallelization of the butterfly algorithm is introduced which, assuming a message latency of α and per-process inverse bandwidth of β, executes in at most (Equation Presented.) time using p processes. This parallel algorithm was then instantiated in the form of the open-source DistButterfly library for the special case where K(x, y) = exp(iΦ(x, y)), where Φ(x, y) is a black-box, sufficiently smooth, real-valued phase function. Experiments on Blue Gene/Q demonstrate impressive strong-scaling results for important classes of phase functions. Using quasi-uniform sources, hyperbolic Radon transforms, and an analogue of a three-dimensional generalized Radon transform were, respectively, observed to strong-scale from 1-node/16-cores up to 1024-nodes/16,384-cores with greater than 90% and 82% efficiency, respectively. © 2014 Society for Industrial and Applied Mathematics.
Fast parallel event reconstruction
CERN. Geneva
2010-01-01
On-line processing of large data volumes produced in modern HEP experiments requires using maximum capabilities of modern and future many-core CPU and GPU architectures.One of such powerful feature is a SIMD instruction set, which allows packing several data items in one register and to operate on all of them, thus achievingmore operations per clock cycle. Motivated by the idea of using the SIMD unit ofmodern processors, the KF based track fit has been adapted for parallelism, including memory optimization, numerical analysis, vectorization with inline operator overloading, and optimization using SDKs. The speed of the algorithm has been increased in 120000 times with 0.1 ms/track, running in parallel on 16 SPEs of a Cell Blade computer. Running on a Nehalem CPU with 8 cores it shows the processing speed of 52 ns/track using the Intel Threading Building Blocks. The same KF algorithm running on an Nvidia GTX 280 in the CUDA frameworkprovi...
High-speed detection of emergent market clustering via an unsupervised parallel genetic algorithm
Directory of Open Access Journals (Sweden)
Dieter Hendricks
2016-02-01
Full Text Available We implement a master-slave parallel genetic algorithm with a bespoke log-likelihood fitness function to identify emergent clusters within price evolutions. We use graphics processing units (GPUs to implement a parallel genetic algorithm and visualise the results using disjoint minimal spanning trees. We demonstrate that our GPU parallel genetic algorithm, implemented on a commercially available general purpose GPU, is able to recover stock clusters in sub-second speed, based on a subset of stocks in the South African market. This approach represents a pragmatic choice for low-cost, scalable parallel computing and is significantly faster than a prototype serial implementation in an optimised C-based fourth-generation programming language, although the results are not directly comparable because of compiler differences. Combined with fast online intraday correlation matrix estimation from high frequency data for cluster identification, the proposed implementation offers cost-effective, near-real-time risk assessment for financial practitioners.
Temporal scalability comparison of the H.264/SVC and distributed video codec
DEFF Research Database (Denmark)
Huang, Xin; Ukhanova, Ann; Belyaev, Evgeny
2009-01-01
The problem of the multimedia scalable video streaming is a current topic of interest. There exist many methods for scalable video coding. This paper is focused on the scalable extension of H.264/AVC (H.264/SVC) and distributed video coding (DVC). The paper presents an efficiency comparison of SV...
BIGSdb: Scalable analysis of bacterial genome variation at the population level
Directory of Open Access Journals (Sweden)
Maiden Martin CJ
2010-12-01
Full Text Available Abstract Background The opportunities for bacterial population genomics that are being realised by the application of parallel nucleotide sequencing require novel bioinformatics platforms. These must be capable of the storage, retrieval, and analysis of linked phenotypic and genotypic information in an accessible, scalable and computationally efficient manner. Results The Bacterial Isolate Genome Sequence Database (BIGSDB is a scalable, open source, web-accessible database system that meets these needs, enabling phenotype and sequence data, which can range from a single sequence read to whole genome data, to be efficiently linked for a limitless number of bacterial specimens. The system builds on the widely used mlstdbNet software, developed for the storage and distribution of multilocus sequence typing (MLST data, and incorporates the capacity to define and identify any number of loci and genetic variants at those loci within the stored nucleotide sequences. These loci can be further organised into 'schemes' for isolate characterisation or for evolutionary or functional analyses. Isolates and loci can be indexed by multiple names and any number of alternative schemes can be accommodated, enabling cross-referencing of different studies and approaches. LIMS functionality of the software enables linkage to and organisation of laboratory samples. The data are easily linked to external databases and fine-grained authentication of access permits multiple users to participate in community annotation by setting up or contributing to different schemes within the database. Some of the applications of BIGSDB are illustrated with the genera Neisseria and Streptococcus. The BIGSDB source code and documentation are available at http://pubmlst.org/software/database/bigsdb/. Conclusions Genomic data can be used to characterise bacterial isolates in many different ways but it can also be efficiently exploited for evolutionary or functional studies. BIGSDB
A Cross-Platform Infrastructure for Scalable Runtime Application Performance Analysis
Energy Technology Data Exchange (ETDEWEB)
Jack Dongarra; Shirley Moore; Bart Miller, Jeffrey Hollingsworth; Tracy Rafferty
2005-03-15
The purpose of this project was to build an extensible cross-platform infrastructure to facilitate the development of accurate and portable performance analysis tools for current and future high performance computing (HPC) architectures. Major accomplishments include tools and techniques for multidimensional performance analysis, as well as improved support for dynamic performance monitoring of multithreaded and multiprocess applications. Previous performance tool development has been limited by the burden of having to re-write a platform-dependent low-level substrate for each architecture/operating system pair in order to obtain the necessary performance data from the system. Manual interpretation of performance data is not scalable for large-scale long-running applications. The infrastructure developed by this project provides a foundation for building portable and scalable performance analysis tools, with the end goal being to provide application developers with the information they need to analyze, understand, and tune the performance of terascale applications on HPC architectures. The backend portion of the infrastructure provides runtime instrumentation capability and access to hardware performance counters, with thread-safety for shared memory environments and a communication substrate to support instrumentation of multiprocess and distributed programs. Front end interfaces provides tool developers with a well-defined, platform-independent set of calls for requesting performance data. End-user tools have been developed that demonstrate runtime data collection, on-line and off-line analysis of performance data, and multidimensional performance analysis. The infrastructure is based on two underlying performance instrumentation technologies. These technologies are the PAPI cross-platform library interface to hardware performance counters and the cross-platform Dyninst library interface for runtime modification of executable images. The Paradyn and KOJAK
International Nuclear Information System (INIS)
DeHart, Mark D.; Williams, Mark L.; Bowman, Stephen M.
2010-01-01
The SCALE computational architecture has remained basically the same since its inception 30 years ago, although constituent modules and capabilities have changed significantly. This SCALE concept was intended to provide a framework whereby independent codes can be linked to provide a more comprehensive capability than possible with the individual programs - allowing flexibility to address a wide variety of applications. However, the current system was designed originally for mainframe computers with a single CPU and with significantly less memory than today's personal computers. It has been recognized that the present SCALE computation system could be restructured to take advantage of modern hardware and software capabilities, while retaining many of the modular features of the present system. Preliminary work is being done to define specifications and capabilities for a more advanced computational architecture. This paper describes the state of current SCALE development activities and plans for future development. With the release of SCALE 6.1 in 2010, a new phase of evolutionary development will be available to SCALE users within the TRITON and NEWT modules. The SCALE (Standardized Computer Analyses for Licensing Evaluation) code system developed by Oak Ridge National Laboratory (ORNL) provides a comprehensive and integrated package of codes and nuclear data for a wide range of applications in criticality safety, reactor physics, shielding, isotopic depletion and decay, and sensitivity/uncertainty (S/U) analysis. Over the last three years, since the release of version 5.1 in 2006, several important new codes have been introduced within SCALE, and significant advances applied to existing codes. Many of these new features became available with the release of SCALE 6.0 in early 2009. However, beginning with SCALE 6.1, a first generation of parallel computing is being introduced. In addition to near-term improvements, a plan for longer term SCALE enhancement
Parallel Computing Characteristics of CUPID code under MPI and Hybrid environment
Energy Technology Data Exchange (ETDEWEB)
Lee, Jae Ryong; Yoon, Han Young [Korea Atomic Energy Research Institute, Daejeon (Korea, Republic of); Jeon, Byoung Jin; Choi, Hyoung Gwon [Seoul National Univ. of Science and Technology, Seoul (Korea, Republic of)
2014-05-15
In this paper, a characteristic of parallel algorithm is presented for solving an elliptic type equation of CUPID via domain decomposition method using the MPI and the parallel performance is estimated in terms of a scalability which shows the speedup ratio. In addition, the time-consuming pattern of major subroutines is studied. Two different grid systems are taken into account: 40,000 meshes for coarse system and 320,000 meshes for fine system. Since the matrix of the CUPID code differs according to whether the flow is single-phase or two-phase, the effect of matrix shape is evaluated. Finally, the effect of the preconditioner for matrix solver is also investigated. Finally, the hybrid (OpenMP+MPI) parallel algorithm is introduced and discussed in detail for solving pressure solver. Component-scale thermal-hydraulics code, CUPID has been developed for two-phase flow analysis, which adopts a three-dimensional, transient, three-field model, and parallelized to fulfill a recent demand for long-transient and highly resolved multi-phase flow behavior. In this study, the parallel performance of the CUPID code was investigated in terms of scalability. The CUPID code was parallelized with domain decomposition method. The MPI library was adopted to communicate the information at the neighboring domain. For managing the sparse matrix effectively, the CSR storage format is used. To take into account the characteristics of the pressure matrix which turns to be asymmetric for two-phase flow, both single-phase and two-phase calculations were run. In addition, the effect of the matrix size and preconditioning was also investigated. The fine mesh calculation shows better scalability than the coarse mesh because the number of coarse mesh does not need to decompose the computational domain excessively. The fine mesh can be present good scalability when dividing geometry with considering the ratio between computation and communication time. For a given mesh, single-phase flow
Parallel Polarization State Generation.
She, Alan; Capasso, Federico
2016-05-17
The control of polarization, an essential property of light, is of wide scientific and technological interest. The general problem of generating arbitrary time-varying states of polarization (SOP) has always been mathematically formulated by a series of linear transformations, i.e. a product of matrices, imposing a serial architecture. Here we show a parallel architecture described by a sum of matrices. The theory is experimentally demonstrated by modulating spatially-separated polarization components of a laser using a digital micromirror device that are subsequently beam combined. This method greatly expands the parameter space for engineering devices that control polarization. Consequently, performance characteristics, such as speed, stability, and spectral range, are entirely dictated by the technologies of optical intensity modulation, including absorption, reflection, emission, and scattering. This opens up important prospects for polarization state generation (PSG) with unique performance characteristics with applications in spectroscopic ellipsometry, spectropolarimetry, communications, imaging, and security.
Parallel imaging microfluidic cytometer.
Ehrlich, Daniel J; McKenna, Brian K; Evans, James G; Belkina, Anna C; Denis, Gerald V; Sherr, David H; Cheung, Man Ching
2011-01-01
By adding an additional degree of freedom from multichannel flow, the parallel microfluidic cytometer (PMC) combines some of the best features of fluorescence-activated flow cytometry (FCM) and microscope-based high-content screening (HCS). The PMC (i) lends itself to fast processing of large numbers of samples, (ii) adds a 1D imaging capability for intracellular localization assays (HCS), (iii) has a high rare-cell sensitivity, and (iv) has an unusual capability for time-synchronized sampling. An inability to practically handle large sample numbers has restricted applications of conventional flow cytometers and microscopes in combinatorial cell assays, network biology, and drug discovery. The PMC promises to relieve a bottleneck in these previously constrained applications. The PMC may also be a powerful tool for finding rare primary cells in the clinic. The multichannel architecture of current PMC prototypes allows 384 unique samples for a cell-based screen to be read out in ∼6-10 min, about 30 times the speed of most current FCM systems. In 1D intracellular imaging, the PMC can obtain protein localization using HCS marker strategies at many times for the sample throughput of charge-coupled device (CCD)-based microscopes or CCD-based single-channel flow cytometers. The PMC also permits the signal integration time to be varied over a larger range than is practical in conventional flow cytometers. The signal-to-noise advantages are useful, for example, in counting rare positive cells in the most difficult early stages of genome-wide screening. We review the status of parallel microfluidic cytometry and discuss some of the directions the new technology may take. Copyright © 2011 Elsevier Inc. All rights reserved.
Scientific visualization uncertainty, multifield, biomedical, and scalable visualization
Chen, Min; Johnson, Christopher; Kaufman, Arie; Hagen, Hans
2014-01-01
Based on the seminar that took place in Dagstuhl, Germany in June 2011, this contributed volume studies the four important topics within the scientific visualization field: uncertainty visualization, multifield visualization, biomedical visualization and scalable visualization. • Uncertainty visualization deals with uncertain data from simulations or sampled data, uncertainty due to the mathematical processes operating on the data, and uncertainty in the visual representation, • Multifield visualization addresses the need to depict multiple data at individual locations and the combination of multiple datasets, • Biomedical is a vast field with select subtopics addressed from scanning methodologies to structural applications to biological applications, • Scalability in scientific visualization is critical as data grows and computational devices range from hand-held mobile devices to exascale computational platforms. Scientific Visualization will be useful to practitioners of scientific visualization, ...
Scalable quantum memory in the ultrastrong coupling regime.
Kyaw, T H; Felicetti, S; Romero, G; Solano, E; Kwek, L-C
2015-03-02
Circuit quantum electrodynamics, consisting of superconducting artificial atoms coupled to on-chip resonators, represents a prime candidate to implement the scalable quantum computing architecture because of the presence of good tunability and controllability. Furthermore, recent advances have pushed the technology towards the ultrastrong coupling regime of light-matter interaction, where the qubit-resonator coupling strength reaches a considerable fraction of the resonator frequency. Here, we propose a qubit-resonator system operating in that regime, as a quantum memory device and study the storage and retrieval of quantum information in and from the Z2 parity-protected quantum memory, within experimentally feasible schemes. We are also convinced that our proposal might pave a way to realize a scalable quantum random-access memory due to its fast storage and readout performances.
Fast & scalable pattern transfer via block copolymer nanolithography
DEFF Research Database (Denmark)
Li, Tao; Wang, Zhongli; Schulte, Lars
2015-01-01
A fully scalable and efficient pattern transfer process based on block copolymer (BCP) self-assembling directly on various substrates is demonstrated. PS-rich and PDMS-rich poly(styrene-b-dimethylsiloxane) (PS-b-PDMS) copolymers are used to give monolayer sphere morphology after spin-casting of s......A fully scalable and efficient pattern transfer process based on block copolymer (BCP) self-assembling directly on various substrates is demonstrated. PS-rich and PDMS-rich poly(styrene-b-dimethylsiloxane) (PS-b-PDMS) copolymers are used to give monolayer sphere morphology after spin...... on long range lateral order, including fabrication of substrates for catalysis, solar cells, sensors, ultrafiltration membranes and templating of semiconductors or metals....
Semantic Models for Scalable Search in the Internet of Things
Directory of Open Access Journals (Sweden)
Dennis Pfisterer
2013-03-01
Full Text Available The Internet of Things is anticipated to connect billions of embedded devices equipped with sensors to perceive their surroundings. Thereby, the state of the real world will be available online and in real-time and can be combined with other data and services in the Internet to realize novel applications such as Smart Cities, Smart Grids, or Smart Healthcare. This requires an open representation of sensor data and scalable search over data from diverse sources including sensors. In this paper we show how the Semantic Web technologies RDF (an open semantic data format and SPARQL (a query language for RDF-encoded data can be used to address those challenges. In particular, we describe how prediction models can be employed for scalable sensor search, how these prediction models can be encoded as RDF, and how the models can be queried by means of SPARQL.
Scalability of DL_POLY on High Performance Computing Platform
Directory of Open Access Journals (Sweden)
Mabule Samuel Mabakane
2017-12-01
Full Text Available This paper presents a case study on the scalability of several versions of the molecular dynamics code (DL_POLY performed on South Africa‘s Centre for High Performance Computing e1350 IBM Linux cluster, Sun system and Lengau supercomputers. Within this study different problem sizes were designed and the same chosen systems were employed in order to test the performance of DL_POLY using weak and strong scalability. It was found that the speed-up results for the small systems were better than large systems on both Ethernet and Infiniband network. However, simulations of large systems in DL_POLY performed well using Infiniband network on Lengau cluster as compared to e1350 and Sun supercomputer.
On the Scalability of Time-predictable Chip-Multiprocessing
DEFF Research Database (Denmark)
Puffitsch, Wolfgang; Schoeberl, Martin
2012-01-01
Real-time systems need a time-predictable execution platform to be able to determine the worst-case execution time statically. In order to be time-predictable, several advanced processor features, such as out-of-order execution and other forms of speculation, have to be avoided. However, just using...... simple processors is not an option for embedded systems with high demands on computing power. In order to provide high performance and predictability we argue to use multiprocessor systems with a time-predictable memory interface. In this paper we present the scalability of a Java chip......-multiprocessor system that is designed to be time-predictable. Adding time-predictable caches is mandatory to achieve scalability with a shared memory multi-processor system. As Java bytecode retains information about the nature of memory accesses, it is possible to implement a memory hierarchy that takes...
ATLAS Grid Data Processing: system evolution and scalability
Golubkov, D; The ATLAS collaboration; Klimentov, A; Minaenko, A; Nevski, P; Vaniachine, A; Walker, R
2012-01-01
The production system for Grid Data Processing handles petascale ATLAS data reprocessing and Monte Carlo activities. The production system empowered further data processing steps on the Grid performed by dozens of ATLAS physics groups with coordinated access to computing resources worldwide, including additional resources sponsored by regional facilities. The system provides knowledge management of configuration parameters for massive data processing tasks, reproducibility of results, scalable database access, orchestrated workflow and performance monitoring, dynamic workload sharing, automated fault tolerance and petascale data integrity control. The system evolves to accommodate a growing number of users and new requirements from our contacts in ATLAS main areas: Trigger, Physics, Data Preparation and Software & Computing. To assure scalability, the next generation production system architecture development is in progress. We report on scaling up the production system for a growing number of users provi...
NPTool: Towards Scalability and Reliability of Business Process Management
Braghetto, Kelly Rosa; Ferreira, João Eduardo; Pu, Calton
Currently one important challenge in business process management is provide at the same time scalability and reliability of business process executions. This difficulty becomes more accentuated when the execution control assumes complex countless business processes. This work presents NavigationPlanTool (NPTool), a tool to control the execution of business processes. NPTool is supported by Navigation Plan Definition Language (NPDL), a language for business processes specification that uses process algebra as formal foundation. NPTool implements the NPDL language as a SQL extension. The main contribution of this paper is a description of the NPTool showing how the process algebra features combined with a relational database model can be used to provide a scalable and reliable control in the execution of business processes. The next steps of NPTool include reuse of control-flow patterns and support to data flow management.
Proof of Stake Blockchain: Performance and Scalability for Groupware Communications
DEFF Research Database (Denmark)
Spasovski, Jason; Eklund, Peter
2017-01-01
A blockchain is a distributed transaction ledger, a disruptive technology that creates new possibilities for digital ecosystems. The blockchain ecosystem maintains an immutable transaction record to support many types of digital services. This paper compares the performance and scalability of a web......-based groupware communication application using both non-blockchain and blockchain technologies. Scalability is measured where message load is synthesized over two typical communication topologies. The first is 1 to n network -- a typical client-server or star-topology with a central vertex (server) receiving all...... messages from the remaining n - 1 vertices (clients). The second is a more naturally occurring scale-free network topology, where multiple communication hubs are distributed throughout the network. System performance is tested with both blockchain and non-blockchain solutions using multiple cloud computing...
Continuity-Aware Scheduling Algorithm for Scalable Video Streaming
Directory of Open Access Journals (Sweden)
Atinat Palawan
2016-05-01
Full Text Available The consumer demand for retrieving and delivering visual content through consumer electronic devices has increased rapidly in recent years. The quality of video in packet networks is susceptible to certain traffic characteristics: average bandwidth availability, loss, delay and delay variation (jitter. This paper presents a scheduling algorithm that modifies the stream of scalable video to combat jitter. The algorithm provides unequal look-ahead by safeguarding the base layer (without the need for overhead of the scalable video. The results of the experiments show that our scheduling algorithm reduces the number of frames with a violated deadline and significantly improves the continuity of the video stream without compromising the average Y Peek Signal-to-Noise Ratio (PSNR.
Scalable, full-colour and controllable chromotropic plasmonic printing
Xue, Jiancai; Zhou, Zhang-Kai; Wei, Zhiqiang; Su, Rongbin; Lai, Juan; Li, Juntao; Li, Chao; Zhang, Tengwei; Wang, Xue-Hua
2015-01-01
Plasmonic colour printing has drawn wide attention as a promising candidate for the next-generation colour-printing technology. However, an efficient approach to realize full colour and scalable fabrication is still lacking, which prevents plasmonic colour printing from practical applications. Here we present a scalable and full-colour plasmonic printing approach by combining conjugate twin-phase modulation with a plasmonic broadband absorber. More importantly, our approach also demonstrates controllable chromotropic capability, that is, the ability of reversible colour transformations. This chromotropic capability affords enormous potentials in building functionalized prints for anticounterfeiting, special label, and high-density data encryption storage. With such excellent performances in functional colour applications, this colour-printing approach could pave the way for plasmonic colour printing in real-world commercial utilization. PMID:26567803
About Parallel Programming: Paradigms, Parallel Execution and Collaborative Systems
Directory of Open Access Journals (Sweden)
Loredana MOCEAN
2009-01-01
Full Text Available In the last years, there were made efforts for delineation of a stabile and unitary frame, where the problems of logical parallel processing must find solutions at least at the level of imperative languages. The results obtained by now are not at the level of the made efforts. This paper wants to be a little contribution at these efforts. We propose an overview in parallel programming, parallel execution and collaborative systems.
A Scalable Heuristic for Viral Marketing Under the Tipping Model
2013-09-01
Flixster is a social media website that allows users to share reviews and other information about cinema . [35] It was extracted in Dec. 2010. – FourSquare...work of Reichman were developed independently . We also note that Reichman performs no experimental evaluation of the algorithm. A Scalable Heuristic...other dif- fusion models, such as the independent cascade model [21] and evolutionary graph theory [25] as well as probabilistic variants of the
A Scalable Communication Architecture for Advanced Metering Infrastructure
Ngo Hoang , Giang; Liquori , Luigi; Nguyen Chan , Hung
2013-01-01
Advanced Metering Infrastructure (AMI), seen as foundation for overall grid modernization, is an integration of many technologies that provides an intelligent connection between consumers and system operators [ami 2008]. One of the biggest challenge that AMI faces is to scalable collect and manage a huge amount of data from a large number of customers. In our paper, we address this challenge by introducing a mixed peer-to-peer (P2P) and client-server communication architecture for AMI in whic...
Scalable Multi-group Key Management for Advanced Metering Infrastructure
Benmalek , Mourad; Challal , Yacine; Bouabdallah , Abdelmadjid
2015-01-01
International audience; Advanced Metering Infrastructure (AMI) is composed of systems and networks to incorporate changes for modernizing the electricity grid, reduce peak loads, and meet energy efficiency targets. AMI is a privileged target for security attacks with potentially great damage against infrastructures and privacy. For this reason, Key Management has been identified as one of the most challenging topics in AMI development. In this paper, we propose a new Scalable multi-group key ...
Economical and scalable synthesis of 6-amino-2-cyanobenzothiazole
Directory of Open Access Journals (Sweden)
Jacob R. Hauser
2016-09-01
Full Text Available 2-Cyanobenzothiazoles (CBTs are useful building blocks for: 1 luciferin derivatives for bioluminescent imaging; and 2 handles for bioorthogonal ligations. A particularly versatile CBT is 6-amino-2-cyanobenzothiazole (ACBT, which has an amine handle for straight-forward derivatisation. Here we present an economical and scalable synthesis of ACBT based on a cyanation catalysed by 1,4-diazabicyclo[2.2.2]octane (DABCO, and discuss its advantages for scale-up over previously reported routes.
Architectural Techniques to Enable Reliable and Scalable Memory Systems
Nair, Prashant J.
2017-01-01
High capacity and scalable memory systems play a vital role in enabling our desktops, smartphones, and pervasive technologies like Internet of Things (IoT). Unfortunately, memory systems are becoming increasingly prone to faults. This is because we rely on technology scaling to improve memory density, and at small feature sizes, memory cells tend to break easily. Today, memory reliability is seen as the key impediment towards using high-density devices, adopting new technologies, and even bui...
Parallel processing in nuclear applications
International Nuclear Information System (INIS)
Muniz, Francisco Junqueira
1995-01-01
This paper summarizes some investigations on effective and scalable dynamic load-balancing mechanisms suitable for distributed-memory (loosely-coupled) MIMD systems. The selected implementation environment is composed of T800 transputers programed in the occam and C languages and an automatic routing package communication software mechanism (the virtual channel router). Tasks were generated, at execution time, using a multiple-spawning mechanism based on a set of remote procedure calls primitives. The objective is to improve maximum resource utilization. In particular, the investigation described here facilitate portability of the user application, since it concentrates on system-level load balancing mechanisms. The load-balancing mechanisms studies are also suitable for systems that can vary in size, concentrating on methods with potential for scalability. Two possible application examples, chosen from the nuclear area, where distributed-memory MIMD machines can be utilized, are mentioned. (author). 24 refs., 1 fig