WorldWideScience

Sample records for temporal-difference reinforcement learning

  1. Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning

    NARCIS (Netherlands)

    Whiteson, S.; Taylor, M.E.; Stone, P.

    2010-01-01

    Temporal difference and evolutionary methods are two of the most common approaches to solving reinforcement learning problems. However, there is little consensus on their relative merits and there have been few empirical studies that directly compare their performance. This article aims to address

  2. Value learning through reinforcement : The basics of dopamine and reinforcement learning

    NARCIS (Netherlands)

    Daw, N.D.; Tobler, P.N.; Glimcher, P.W.; Fehr, E.

    2013-01-01

    This chapter provides an overview of reinforcement learning and temporal difference learning and relates these topics to the firing properties of midbrain dopamine neurons. First, we review the RescorlaWagner learning rule and basic learning phenomena, such as blocking, which the rule explains. Then

  3. Temporal Memory Reinforcement Learning for the Autonomous Micro-mobile Robot Based-behavior

    Institute of Scientific and Technical Information of China (English)

    Yang Yujun(杨玉君); Cheng Junshi; Chen Jiapin; Li Xiaohai

    2004-01-01

    This paper presents temporal memory reinforcement learning for the autonomous micro-mobile robot based-behavior. Human being has a memory oblivion process, i.e. the earlier to memorize, the earlier to forget, only the repeated thing can be remembered firmly. Enlightening forms this, and the robot need not memorize all the past states, at the same time economizes the EMS memory space, which is not enough in the MPU of our AMRobot. The proposed algorithm is an extension of the Q-learning, which is an incremental reinforcement learning method. The results of simulation have shown that the algorithm is valid.

  4. TEXPLORE temporal difference reinforcement learning for robots and time-constrained domains

    CERN Document Server

    Hester, Todd

    2013-01-01

    This book presents and develops new reinforcement learning methods that enable fast and robust learning on robots in real-time. Robots have the potential to solve many problems in society, because of their ability to work in dangerous places doing necessary jobs that no one wants or is able to do. One barrier to their widespread deployment is that they are mainly limited to tasks where it is possible to hand-program behaviors for every situation that may be encountered. For robots to meet their potential, they need methods that enable them to learn and adapt to novel situations that they were not programmed for. Reinforcement learning (RL) is a paradigm for learning sequential decision making processes and could solve the problems of learning and adaptation on robots. This book identifies four key challenges that must be addressed for an RL algorithm to be practical for robotic control tasks. These RL for Robotics Challenges are: 1) it must learn in very few samples; 2) it must learn in domains with continuou...

  5. Kernel Temporal Differences for Neural Decoding

    Science.gov (United States)

    Bae, Jihye; Sanchez Giraldo, Luis G.; Pohlmeyer, Eric A.; Francis, Joseph T.; Sanchez, Justin C.; Príncipe, José C.

    2015-01-01

    We study the feasibility and capability of the kernel temporal difference (KTD)(λ) algorithm for neural decoding. KTD(λ) is an online, kernel-based learning algorithm, which has been introduced to estimate value functions in reinforcement learning. This algorithm combines kernel-based representations with the temporal difference approach to learning. One of our key observations is that by using strictly positive definite kernels, algorithm's convergence can be guaranteed for policy evaluation. The algorithm's nonlinear functional approximation capabilities are shown in both simulations of policy evaluation and neural decoding problems (policy improvement). KTD can handle high-dimensional neural states containing spatial-temporal information at a reasonable computational complexity allowing real-time applications. When the algorithm seeks a proper mapping between a monkey's neural states and desired positions of a computer cursor or a robot arm, in both open-loop and closed-loop experiments, it can effectively learn the neural state to action mapping. Finally, a visualization of the coadaptation process between the decoder and the subject shows the algorithm's capabilities in reinforcement learning brain machine interfaces. PMID:25866504

  6. A theoretical analysis of temporal difference learning in the iterated prisoner's dilemma game.

    Science.gov (United States)

    Masuda, Naoki; Ohtsuki, Hisashi

    2009-11-01

    Direct reciprocity is a chief mechanism of mutual cooperation in social dilemma. Agents cooperate if future interactions with the same opponents are highly likely. Direct reciprocity has been explored mostly by evolutionary game theory based on natural selection. Our daily experience tells, however, that real social agents including humans learn to cooperate based on experience. In this paper, we analyze a reinforcement learning model called temporal difference learning and study its performance in the iterated Prisoner's Dilemma game. Temporal difference learning is unique among a variety of learning models in that it inherently aims at increasing future payoffs, not immediate ones. It also has a neural basis. We analytically and numerically show that learners with only two internal states properly learn to cooperate with retaliatory players and to defect against unconditional cooperators and defectors. Four-state learners are more capable of achieving a high payoff against various opponents. Moreover, we numerically show that four-state learners can learn to establish mutual cooperation for sufficiently small learning rates.

  7. Multiagent-Based Simulation of Temporal-Spatial Characteristics of Activity-Travel Patterns Using Interactive Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Min Yang

    2014-01-01

    Full Text Available We propose a multiagent-based reinforcement learning algorithm, in which the interactions between travelers and the environment are considered to simulate temporal-spatial characteristics of activity-travel patterns in a city. Road congestion degree is added to the reinforcement learning algorithm as a medium that passes the influence of one traveler’s decision to others. Meanwhile, the agents used in the algorithm are initialized from typical activity patterns extracted from the travel survey diary data of Shangyu city in China. In the simulation, both macroscopic activity-travel characteristics such as traffic flow spatial-temporal distribution and microscopic characteristics such as activity-travel schedules of each agent are obtained. Comparing the simulation results with the survey data, we find that deviation of the peak-hour traffic flow is less than 5%, while the correlation of the simulated versus survey location choice distribution is over 0.9.

  8. Applying reinforcement learning to the weapon assignment problem in air defence

    CSIR Research Space (South Africa)

    Mouton, H

    2011-12-01

    Full Text Available . The techniques investigated in this article were two methods from the machine-learning subfield of reinforcement learning (RL), namely a Monte Carlo (MC) control algorithm with exploring starts (MCES), and an off-policy temporal-difference (TD) learning...

  9. Self-Paced Prioritized Curriculum Learning With Coverage Penalty in Deep Reinforcement Learning.

    Science.gov (United States)

    Ren, Zhipeng; Dong, Daoyi; Li, Huaxiong; Chen, Chunlin; Zhipeng Ren; Daoyi Dong; Huaxiong Li; Chunlin Chen; Dong, Daoyi; Li, Huaxiong; Chen, Chunlin; Ren, Zhipeng

    2018-06-01

    In this paper, a new training paradigm is proposed for deep reinforcement learning using self-paced prioritized curriculum learning with coverage penalty. The proposed deep curriculum reinforcement learning (DCRL) takes the most advantage of experience replay by adaptively selecting appropriate transitions from replay memory based on the complexity of each transition. The criteria of complexity in DCRL consist of self-paced priority as well as coverage penalty. The self-paced priority reflects the relationship between the temporal-difference error and the difficulty of the current curriculum for sample efficiency. The coverage penalty is taken into account for sample diversity. With comparison to deep Q network (DQN) and prioritized experience replay (PER) methods, the DCRL algorithm is evaluated on Atari 2600 games, and the experimental results show that DCRL outperforms DQN and PER on most of these games. More results further show that the proposed curriculum training paradigm of DCRL is also applicable and effective for other memory-based deep reinforcement learning approaches, such as double DQN and dueling network. All the experimental results demonstrate that DCRL can achieve improved training efficiency and robustness for deep reinforcement learning.

  10. Human-level control through deep reinforcement learning

    Science.gov (United States)

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A.; Veness, Joel; Bellemare, Marc G.; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K.; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-01

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  11. Human-level control through deep reinforcement learning.

    Science.gov (United States)

    Mnih, Volodymyr; Kavukcuoglu, Koray; Silver, David; Rusu, Andrei A; Veness, Joel; Bellemare, Marc G; Graves, Alex; Riedmiller, Martin; Fidjeland, Andreas K; Ostrovski, Georg; Petersen, Stig; Beattie, Charles; Sadik, Amir; Antonoglou, Ioannis; King, Helen; Kumaran, Dharshan; Wierstra, Daan; Legg, Shane; Hassabis, Demis

    2015-02-26

    The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching real-world complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from high-dimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Q-network agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between high-dimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks.

  12. Self-Play and Using an Expert to Learn to Play Backgammon with Temporal Difference Learning

    NARCIS (Netherlands)

    Wiering, Marco A.

    2010-01-01

    A promising approach to learn to play board games is to use reinforcement learning algorithms that can learn a game position evaluation function. In this paper we examine and compare three different methods for generating training games: 1) Learning by self-play, 2) Learning by playing against an

  13. GA-based fuzzy reinforcement learning for control of a magnetic bearing system.

    Science.gov (United States)

    Lin, C T; Jou, C P

    2000-01-01

    This paper proposes a TD (temporal difference) and GA (genetic algorithm)-based reinforcement (TDGAR) learning method and applies it to the control of a real magnetic bearing system. The TDGAR learning scheme is a new hybrid GA, which integrates the TD prediction method and the GA to perform the reinforcement learning task. The TDGAR learning system is composed of two integrated feedforward networks. One neural network acts as a critic network to guide the learning of the other network (the action network) which determines the outputs (actions) of the TDGAR learning system. The action network can be a normal neural network or a neural fuzzy network. Using the TD prediction method, the critic network can predict the external reinforcement signal and provide a more informative internal reinforcement signal to the action network. The action network uses the GA to adapt itself according to the internal reinforcement signal. The key concept of the TDGAR learning scheme is to formulate the internal reinforcement signal as the fitness function for the GA such that the GA can evaluate the candidate solutions (chromosomes) regularly, even during periods without external feedback from the environment. This enables the GA to proceed to new generations regularly without waiting for the arrival of the external reinforcement signal. This can usually accelerate the GA learning since a reinforcement signal may only be available at a time long after a sequence of actions has occurred in the reinforcement learning problem. The proposed TDGAR learning system has been used to control an active magnetic bearing (AMB) system in practice. A systematic design procedure is developed to achieve successful integration of all the subsystems including magnetic suspension, mechanical structure, and controller training. The results show that the TDGAR learning scheme can successfully find a neural controller or a neural fuzzy controller for a self-designed magnetic bearing system.

  14. Rational and Mechanistic Perspectives on Reinforcement Learning

    Science.gov (United States)

    Chater, Nick

    2009-01-01

    This special issue describes important recent developments in applying reinforcement learning models to capture neural and cognitive function. But reinforcement learning, as a theoretical framework, can apply at two very different levels of description: "mechanistic" and "rational." Reinforcement learning is often viewed in mechanistic terms--as…

  15. Neural correlates of reinforcement learning and social preferences in competitive bidding.

    Science.gov (United States)

    van den Bos, Wouter; Talwar, Arjun; McClure, Samuel M

    2013-01-30

    In competitive social environments, people often deviate from what rational choice theory prescribes, resulting in losses or suboptimal monetary gains. We investigate how competition affects learning and decision-making in a common value auction task. During the experiment, groups of five human participants were simultaneously scanned using MRI while playing the auction task. We first demonstrate that bidding is well characterized by reinforcement learning with biased reward representations dependent on social preferences. Indicative of reinforcement learning, we found that estimated trial-by-trial prediction errors correlated with activity in the striatum and ventromedial prefrontal cortex. Additionally, we found that individual differences in social preferences were related to activity in the temporal-parietal junction and anterior insula. Connectivity analyses suggest that monetary and social value signals are integrated in the ventromedial prefrontal cortex and striatum. Based on these results, we argue for a novel mechanistic account for the integration of reinforcement history and social preferences in competitive decision-making.

  16. Reinforcement Learning State-of-the-Art

    CERN Document Server

    Wiering, Marco

    2012-01-01

    Reinforcement learning encompasses both a science of adaptive behavior of rational beings in uncertain environments and a computational methodology for finding optimal behaviors for challenging problems in control, optimization and adaptive behavior of intelligent agents. As a field, reinforcement learning has progressed tremendously in the past decade. The main goal of this book is to present an up-to-date series of survey articles on the main contemporary sub-fields of reinforcement learning. This includes surveys on partially observable environments, hierarchical task decompositions, relational knowledge representation and predictive state representations. Furthermore, topics such as transfer, evolutionary methods and continuous spaces in reinforcement learning are surveyed. In addition, several chapters review reinforcement learning methods in robotics, in games, and in computational neuroscience. In total seventeen different subfields are presented by mostly young experts in those areas, and together the...

  17. Algorithms for Reinforcement Learning

    CERN Document Server

    Szepesvari, Csaba

    2010-01-01

    Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms'

  18. Learning to trade via direct reinforcement.

    Science.gov (United States)

    Moody, J; Saffell, M

    2001-01-01

    We present methods for optimizing portfolios, asset allocations, and trading systems based on direct reinforcement (DR). In this approach, investment decision-making is viewed as a stochastic control problem, and strategies are discovered directly. We present an adaptive algorithm called recurrent reinforcement learning (RRL) for discovering investment policies. The need to build forecasting models is eliminated, and better trading performance is obtained. The direct reinforcement approach differs from dynamic programming and reinforcement algorithms such as TD-learning and Q-learning, which attempt to estimate a value function for the control problem. We find that the RRL direct reinforcement framework enables a simpler problem representation, avoids Bellman's curse of dimensionality and offers compelling advantages in efficiency. We demonstrate how direct reinforcement can be used to optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work using real financial data, we find that our approach based on RRL produces better trading strategies than systems utilizing Q-learning (a value function method). Real-world applications include an intra-daily currency trader and a monthly asset allocation system for the S&P 500 Stock Index and T-Bills.

  19. An imperfect dopaminergic error signal can drive temporal-difference learning.

    Directory of Open Access Journals (Sweden)

    Wiebke Potjans

    2011-05-01

    Full Text Available An open problem in the field of computational neuroscience is how to link synaptic plasticity to system-level learning. A promising framework in this context is temporal-difference (TD learning. Experimental evidence that supports the hypothesis that the mammalian brain performs temporal-difference learning includes the resemblance of the phasic activity of the midbrain dopaminergic neurons to the TD error and the discovery that cortico-striatal synaptic plasticity is modulated by dopamine. However, as the phasic dopaminergic signal does not reproduce all the properties of the theoretical TD error, it is unclear whether it is capable of driving behavior adaptation in complex tasks. Here, we present a spiking temporal-difference learning model based on the actor-critic architecture. The model dynamically generates a dopaminergic signal with realistic firing rates and exploits this signal to modulate the plasticity of synapses as a third factor. The predictions of our proposed plasticity dynamics are in good agreement with experimental results with respect to dopamine, pre- and post-synaptic activity. An analytical mapping from the parameters of our proposed plasticity dynamics to those of the classical discrete-time TD algorithm reveals that the biological constraints of the dopaminergic signal entail a modified TD algorithm with self-adapting learning parameters and an adapting offset. We show that the neuronal network is able to learn a task with sparse positive rewards as fast as the corresponding classical discrete-time TD algorithm. However, the performance of the neuronal network is impaired with respect to the traditional algorithm on a task with both positive and negative rewards and breaks down entirely on a task with purely negative rewards. Our model demonstrates that the asymmetry of a realistic dopaminergic signal enables TD learning when learning is driven by positive rewards but not when driven by negative rewards.

  20. The Reinforcement Learning Competition 2014

    OpenAIRE

    Dimitrakakis, Christos; Li, Guangliang; Tziortziotis, Nikoalos

    2014-01-01

    Reinforcement learning is one of the most general problems in artificial intelligence. It has been used to model problems in automated experiment design, control, economics, game playing, scheduling and telecommunications. The aim of the reinforcement learning competition is to encourage the development of very general learning agents for arbitrary reinforcement learning problems and to provide a test-bed for the unbiased evaluation of algorithms.

  1. Reinforcement learning in complementarity game and population dynamics.

    Science.gov (United States)

    Jost, Jürgen; Li, Wei

    2014-02-01

    We systematically test and compare different reinforcement learning schemes in a complementarity game [J. Jost and W. Li, Physica A 345, 245 (2005)] played between members of two populations. More precisely, we study the Roth-Erev, Bush-Mosteller, and SoftMax reinforcement learning schemes. A modified version of Roth-Erev with a power exponent of 1.5, as opposed to 1 in the standard version, performs best. We also compare these reinforcement learning strategies with evolutionary schemes. This gives insight into aspects like the issue of quick adaptation as opposed to systematic exploration or the role of learning rates.

  2. Does temporal discounting explain unhealthy behavior? A systematic review and reinforcement learning perspective

    Science.gov (United States)

    Story, Giles W.; Vlaev, Ivo; Seymour, Ben; Darzi, Ara; Dolan, Raymond J.

    2014-01-01

    The tendency to make unhealthy choices is hypothesized to be related to an individual's temporal discount rate, the theoretical rate at which they devalue delayed rewards. Furthermore, a particular form of temporal discounting, hyperbolic discounting, has been proposed to explain why unhealthy behavior can occur despite healthy intentions. We examine these two hypotheses in turn. We first systematically review studies which investigate whether discount rates can predict unhealthy behavior. These studies reveal that high discount rates for money (and in some instances food or drug rewards) are associated with several unhealthy behaviors and markers of health status, establishing discounting as a promising predictive measure. We secondly examine whether intention-incongruent unhealthy actions are consistent with hyperbolic discounting. We conclude that intention-incongruent actions are often triggered by environmental cues or changes in motivational state, whose effects are not parameterized by hyperbolic discounting. We propose a framework for understanding these state-based effects in terms of the interplay of two distinct reinforcement learning mechanisms: a “model-based” (or goal-directed) system and a “model-free” (or habitual) system. Under this framework, while discounting of delayed health may contribute to the initiation of unhealthy behavior, with repetition, many unhealthy behaviors become habitual; if health goals then change, habitual behavior can still arise in response to environmental cues. We propose that the burgeoning development of computational models of these processes will permit further identification of health decision-making phenotypes. PMID:24659960

  3. Framework for robot skill learning using reinforcement learning

    Science.gov (United States)

    Wei, Yingzi; Zhao, Mingyang

    2003-09-01

    Robot acquiring skill is a process similar to human skill learning. Reinforcement learning (RL) is an on-line actor critic method for a robot to develop its skill. The reinforcement function has become the critical component for its effect of evaluating the action and guiding the learning process. We present an augmented reward function that provides a new way for RL controller to incorporate prior knowledge and experience into the RL controller. Also, the difference form of augmented reward function is considered carefully. The additional reward beyond conventional reward will provide more heuristic information for RL. In this paper, we present a strategy for the task of complex skill learning. Automatic robot shaping policy is to dissolve the complex skill into a hierarchical learning process. The new form of value function is introduced to attain smooth motion switching swiftly. We present a formal, but practical, framework for robot skill learning and also illustrate with an example the utility of method for learning skilled robot control on line.

  4. The role of multiple neuromodulators in reinforcement learning that is based on competition between eligibility traces

    Directory of Open Access Journals (Sweden)

    Marco A Huertas

    2016-12-01

    Full Text Available The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment, and how to stop learning once the target behaviors are attained (stopping rule. To address the first problem, synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although these were mere theoretical constructs, recent experiements have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP and one for long-term depression (LTD, each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different

  5. The Role of Multiple Neuromodulators in Reinforcement Learning That Is Based on Competition between Eligibility Traces.

    Science.gov (United States)

    Huertas, Marco A; Schwettmann, Sarah E; Shouval, Harel Z

    2016-01-01

    The ability to maximize reward and avoid punishment is essential for animal survival. Reinforcement learning (RL) refers to the algorithms used by biological or artificial systems to learn how to maximize reward or avoid negative outcomes based on past experiences. While RL is also important in machine learning, the types of mechanistic constraints encountered by biological machinery might be different than those for artificial systems. Two major problems encountered by RL are how to relate a stimulus with a reinforcing signal that is delayed in time (temporal credit assignment), and how to stop learning once the target behaviors are attained (stopping rule). To address the first problem synaptic eligibility traces were introduced, bridging the temporal gap between a stimulus and its reward. Although, these were mere theoretical constructs, recent experiments have provided evidence of their existence. These experiments also reveal that the presence of specific neuromodulators converts the traces into changes in synaptic efficacy. A mechanistic implementation of the stopping rule usually assumes the inhibition of the reward nucleus; however, recent experimental results have shown that learning terminates at the appropriate network state even in setups where the reward nucleus cannot be inhibited. In an effort to describe a learning rule that solves the temporal credit assignment problem and implements a biologically plausible stopping rule, we proposed a model based on two separate synaptic eligibility traces, one for long-term potentiation (LTP) and one for long-term depression (LTD), each obeying different dynamics and having different effective magnitudes. The model has been shown to successfully generate stable learning in recurrent networks. Although, the model assumes the presence of a single neuromodulator, evidence indicates that there are different neuromodulators for expressing the different traces. What could be the role of different neuromodulators for

  6. Neural Basis of Reinforcement Learning and Decision Making

    Science.gov (United States)

    Lee, Daeyeol; Seo, Hyojung; Jung, Min Whan

    2012-01-01

    Reinforcement learning is an adaptive process in which an animal utilizes its previous experience to improve the outcomes of future choices. Computational theories of reinforcement learning play a central role in the newly emerging areas of neuroeconomics and decision neuroscience. In this framework, actions are chosen according to their value functions, which describe how much future reward is expected from each action. Value functions can be adjusted not only through reward and penalty, but also by the animal’s knowledge of its current environment. Studies have revealed that a large proportion of the brain is involved in representing and updating value functions and using them to choose an action. However, how the nature of a behavioral task affects the neural mechanisms of reinforcement learning remains incompletely understood. Future studies should uncover the principles by which different computational elements of reinforcement learning are dynamically coordinated across the entire brain. PMID:22462543

  7. Ensemble Network Architecture for Deep Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Xi-liang Chen

    2018-01-01

    Full Text Available The popular deep Q learning algorithm is known to be instability because of the Q-value’s shake and overestimation action values under certain conditions. These issues tend to adversely affect their performance. In this paper, we develop the ensemble network architecture for deep reinforcement learning which is based on value function approximation. The temporal ensemble stabilizes the training process by reducing the variance of target approximation error and the ensemble of target values reduces the overestimate and makes better performance by estimating more accurate Q-value. Our results show that this architecture leads to statistically significant better value evaluation and more stable and better performance on several classical control tasks at OpenAI Gym environment.

  8. Spatial and temporal relations in conditioned reinforcement and observing behavior.

    Science.gov (United States)

    Bowe, C A; Dinsmoor, J A

    1983-03-01

    In Experiment 1, depressing one perch produced stimuli indicating which of two keys, if pecked, could produce food (spatial information) and depressing the other perch produced stimuli indicating whether a variable-interval or an extinction schedule was operating (temporal information). The pigeons increased the time they spent depressing the perch that produced the temporal information but did not increase the time they spent depressing the perch that produced the spatial information. In Experiment 2, pigeons that were allowed to produce combined spatial and temporal information did not acquire the perch pressing any faster or maintain it at a higher level than pigeons allowed to produce only temporal information. Later, when perching produced only spatial information, the time spent depressing the perch eventually declined. The results are not those implied by the statement that information concerning biologically important events is reinforcing but are consistent with an interpretation in terms of the acquisition of reinforcing properties by a stimulus associated with a higher density of primary reinforcement.

  9. Deep Reinforcement Learning: An Overview

    OpenAIRE

    Li, Yuxi

    2017-01-01

    We give an overview of recent exciting achievements of deep reinforcement learning (RL). We discuss six core elements, six important mechanisms, and twelve applications. We start with background of machine learning, deep learning and reinforcement learning. Next we discuss core RL elements, including value function, in particular, Deep Q-Network (DQN), policy, reward, model, planning, and exploration. After that, we discuss important mechanisms for RL, including attention and memory, unsuperv...

  10. Can model-free reinforcement learning explain deontological moral judgments?

    Science.gov (United States)

    Ayars, Alisabeth

    2016-05-01

    Dual-systems frameworks propose that moral judgments are derived from both an immediate emotional response, and controlled/rational cognition. Recently Cushman (2013) proposed a new dual-system theory based on model-free and model-based reinforcement learning. Model-free learning attaches values to actions based on their history of reward and punishment, and explains some deontological, non-utilitarian judgments. Model-based learning involves the construction of a causal model of the world and allows for far-sighted planning; this form of learning fits well with utilitarian considerations that seek to maximize certain kinds of outcomes. I present three concerns regarding the use of model-free reinforcement learning to explain deontological moral judgment. First, many actions that humans find aversive from model-free learning are not judged to be morally wrong. Moral judgment must require something in addition to model-free learning. Second, there is a dearth of evidence for central predictions of the reinforcement account-e.g., that people with different reinforcement histories will, all else equal, make different moral judgments. Finally, to account for the effect of intention within the framework requires certain assumptions which lack support. These challenges are reasonable foci for future empirical/theoretical work on the model-free/model-based framework. Copyright © 2016 Elsevier B.V. All rights reserved.

  11. Reinforcement Learning in Autism Spectrum Disorder

    Directory of Open Access Journals (Sweden)

    Manuela Schuetze

    2017-11-01

    Full Text Available Early behavioral interventions are recognized as integral to standard care in autism spectrum disorder (ASD, and often focus on reinforcing desired behaviors (e.g., eye contact and reducing the presence of atypical behaviors (e.g., echoing others' phrases. However, efficacy of these programs is mixed. Reinforcement learning relies on neurocircuitry that has been reported to be atypical in ASD: prefrontal-sub-cortical circuits, amygdala, brainstem, and cerebellum. Thus, early behavioral interventions rely on neurocircuitry that may function atypically in at least a subset of individuals with ASD. Recent work has investigated physiological, behavioral, and neural responses to reinforcers to uncover differences in motivation and learning in ASD. We will synthesize this work to identify promising avenues for future research that ultimately can be used to enhance the efficacy of early intervention.

  12. Reinforcement learning in supply chains.

    Science.gov (United States)

    Valluri, Annapurna; North, Michael J; Macal, Charles M

    2009-10-01

    Effective management of supply chains creates value and can strategically position companies. In practice, human beings have been found to be both surprisingly successful and disappointingly inept at managing supply chains. The related fields of cognitive psychology and artificial intelligence have postulated a variety of potential mechanisms to explain this behavior. One of the leading candidates is reinforcement learning. This paper applies agent-based modeling to investigate the comparative behavioral consequences of three simple reinforcement learning algorithms in a multi-stage supply chain. For the first time, our findings show that the specific algorithm that is employed can have dramatic effects on the results obtained. Reinforcement learning is found to be valuable in multi-stage supply chains with several learning agents, as independent agents can learn to coordinate their behavior. However, learning in multi-stage supply chains using these postulated approaches from cognitive psychology and artificial intelligence take extremely long time periods to achieve stability which raises questions about their ability to explain behavior in real supply chains. The fact that it takes thousands of periods for agents to learn in this simple multi-agent setting provides new evidence that real world decision makers are unlikely to be using strict reinforcement learning in practice.

  13. Effect of reinforcement learning on coordination of multiangent systems

    Science.gov (United States)

    Bukkapatnam, Satish T. S.; Gao, Greg

    2000-12-01

    For effective coordination of distributed environments involving multiagent systems, learning ability of each agent in the environment plays a crucial role. In this paper, we develop a simple group learning method based on reinforcement, and study its effect on coordination through application to a supply chain procurement scenario involving a computer manufacturer. Here, all parties are represented by self-interested, autonomous agents, each capable of performing specific simple tasks. They negotiate with each other to perform complex tasks and thus coordinate supply chain procurement. Reinforcement learning is intended to enable each agent to reach a best negotiable price within a shortest possible time. Our simulations of the application scenario under different learning strategies reveals the positive effects of reinforcement learning on an agent's as well as the system's performance.

  14. Reinforcement learning in computer vision

    Science.gov (United States)

    Bernstein, A. V.; Burnaev, E. V.

    2018-04-01

    Nowadays, machine learning has become one of the basic technologies used in solving various computer vision tasks such as feature detection, image segmentation, object recognition and tracking. In many applications, various complex systems such as robots are equipped with visual sensors from which they learn state of surrounding environment by solving corresponding computer vision tasks. Solutions of these tasks are used for making decisions about possible future actions. It is not surprising that when solving computer vision tasks we should take into account special aspects of their subsequent application in model-based predictive control. Reinforcement learning is one of modern machine learning technologies in which learning is carried out through interaction with the environment. In recent years, Reinforcement learning has been used both for solving such applied tasks as processing and analysis of visual information, and for solving specific computer vision problems such as filtering, extracting image features, localizing objects in scenes, and many others. The paper describes shortly the Reinforcement learning technology and its use for solving computer vision problems.

  15. A Reinforcement Learning Framework for Spiking Networks with Dynamic Synapses

    Directory of Open Access Journals (Sweden)

    Karim El-Laithy

    2011-01-01

    Full Text Available An integration of both the Hebbian-based and reinforcement learning (RL rules is presented for dynamic synapses. The proposed framework permits the Hebbian rule to update the hidden synaptic model parameters regulating the synaptic response rather than the synaptic weights. This is performed using both the value and the sign of the temporal difference in the reward signal after each trial. Applying this framework, a spiking network with spike-timing-dependent synapses is tested to learn the exclusive-OR computation on a temporally coded basis. Reward values are calculated with the distance between the output spike train of the network and a reference target one. Results show that the network is able to capture the required dynamics and that the proposed framework can reveal indeed an integrated version of Hebbian and RL. The proposed framework is tractable and less computationally expensive. The framework is applicable to a wide class of synaptic models and is not restricted to the used neural representation. This generality, along with the reported results, supports adopting the introduced approach to benefit from the biologically plausible synaptic models in a wide range of intuitive signal processing.

  16. Modeling the violation of reward maximization and invariance in reinforcement schedules.

    Directory of Open Access Journals (Sweden)

    Giancarlo La Camera

    2008-08-01

    Full Text Available It is often assumed that animals and people adjust their behavior to maximize reward acquisition. In visually cued reinforcement schedules, monkeys make errors in trials that are not immediately rewarded, despite having to repeat error trials. Here we show that error rates are typically smaller in trials equally distant from reward but belonging to longer schedules (referred to as "schedule length effect". This violates the principles of reward maximization and invariance and cannot be predicted by the standard methods of Reinforcement Learning, such as the method of temporal differences. We develop a heuristic model that accounts for all of the properties of the behavior in the reinforcement schedule task but whose predictions are not different from those of the standard temporal difference model in choice tasks. In the modification of temporal difference learning introduced here, the effect of schedule length emerges spontaneously from the sensitivity to the immediately preceding trial. We also introduce a policy for general Markov Decision Processes, where the decision made at each node is conditioned on the motivation to perform an instrumental action, and show that the application of our model to the reinforcement schedule task and the choice task are special cases of this general theoretical framework. Within this framework, Reinforcement Learning can approach contextual learning with the mixture of empirical findings and principled assumptions that seem to coexist in the best descriptions of animal behavior. As examples, we discuss two phenomena observed in humans that often derive from the violation of the principle of invariance: "framing," wherein equivalent options are treated differently depending on the context in which they are presented, and the "sunk cost" effect, the greater tendency to continue an endeavor once an investment in money, effort, or time has been made. The schedule length effect might be a manifestation of these

  17. Reinforcement learning for microgrid energy management

    International Nuclear Information System (INIS)

    Kuznetsova, Elizaveta; Li, Yan-Fu; Ruiz, Carlos; Zio, Enrico; Ault, Graham; Bell, Keith

    2013-01-01

    We consider a microgrid for energy distribution, with a local consumer, a renewable generator (wind turbine) and a storage facility (battery), connected to the external grid via a transformer. We propose a 2 steps-ahead reinforcement learning algorithm to plan the battery scheduling, which plays a key role in the achievement of the consumer goals. The underlying framework is one of multi-criteria decision-making by an individual consumer who has the goals of increasing the utilization rate of the battery during high electricity demand (so as to decrease the electricity purchase from the external grid) and increasing the utilization rate of the wind turbine for local use (so as to increase the consumer independence from the external grid). Predictions of available wind power feed the reinforcement learning algorithm for selecting the optimal battery scheduling actions. The embedded learning mechanism allows to enhance the consumer knowledge about the optimal actions for battery scheduling under different time-dependent environmental conditions. The developed framework gives the capability to intelligent consumers to learn the stochastic environment and make use of the experience to select optimal energy management actions. - Highlights: • A consumer exploits a 2 steps-ahead reinforcement learning for battery scheduling. • The Q-learning based mechanism is fed by the predictions of available wind power. • Wind speed state evolutions are modeled with a Markov chain model. • Optimal scheduling actions are learned through the occurrence of similar scenarios. • The consumer manifests a continuous enhance of his knowledge about optimal actions

  18. Social Cognition as Reinforcement Learning: Feedback Modulates Emotion Inference.

    Science.gov (United States)

    Zaki, Jamil; Kallman, Seth; Wimmer, G Elliott; Ochsner, Kevin; Shohamy, Daphna

    2016-09-01

    Neuroscientific studies of social cognition typically employ paradigms in which perceivers draw single-shot inferences about the internal states of strangers. Real-world social inference features much different parameters: People often encounter and learn about particular social targets (e.g., friends) over time and receive feedback about whether their inferences are correct or incorrect. Here, we examined this process and, more broadly, the intersection between social cognition and reinforcement learning. Perceivers were scanned using fMRI while repeatedly encountering three social targets who produced conflicting visual and verbal emotional cues. Perceivers guessed how targets felt and received feedback about whether they had guessed correctly. Visual cues reliably predicted one target's emotion, verbal cues predicted a second target's emotion, and neither reliably predicted the third target's emotion. Perceivers successfully used this information to update their judgments over time. Furthermore, trial-by-trial learning signals-estimated using two reinforcement learning models-tracked activity in ventral striatum and ventromedial pFC, structures associated with reinforcement learning, and regions associated with updating social impressions, including TPJ. These data suggest that learning about others' emotions, like other forms of feedback learning, relies on domain-general reinforcement mechanisms as well as domain-specific social information processing.

  19. A multiplicative reinforcement learning model capturing learning dynamics and interindividual variability in mice

    OpenAIRE

    Bathellier, Brice; Tee, Sui Poh; Hrovat, Christina; Rumpel, Simon

    2013-01-01

    Learning speed can strongly differ across individuals. This is seen in humans and animals. Here, we measured learning speed in mice performing a discrimination task and developed a theoretical model based on the reinforcement learning framework to account for differences between individual mice. We found that, when using a multiplicative learning rule, the starting connectivity values of the model strongly determine the shape of learning curves. This is in contrast to current learning models ...

  20. Reinforcement learning agents providing advice in complex video games

    Science.gov (United States)

    Taylor, Matthew E.; Carboni, Nicholas; Fachantidis, Anestis; Vlahavas, Ioannis; Torrey, Lisa

    2014-01-01

    This article introduces a teacher-student framework for reinforcement learning, synthesising and extending material that appeared in conference proceedings [Torrey, L., & Taylor, M. E. (2013)]. Teaching on a budget: Agents advising agents in reinforcement learning. {Proceedings of the international conference on autonomous agents and multiagent systems}] and in a non-archival workshop paper [Carboni, N., &Taylor, M. E. (2013, May)]. Preliminary results for 1 vs. 1 tactics in StarCraft. {Proceedings of the adaptive and learning agents workshop (at AAMAS-13)}]. In this framework, a teacher agent instructs a student agent by suggesting actions the student should take as it learns. However, the teacher may only give such advice a limited number of times. We present several novel algorithms that teachers can use to budget their advice effectively, and we evaluate them in two complex video games: StarCraft and Pac-Man. Our results show that the same amount of advice, given at different moments, can have different effects on student learning, and that teachers can significantly affect student learning even when students use different learning methods and state representations.

  1. Instructional control of reinforcement learning: a behavioral and neurocomputational investigation.

    Science.gov (United States)

    Doll, Bradley B; Jacobs, W Jake; Sanfey, Alan G; Frank, Michael J

    2009-11-24

    Humans learn how to behave directly through environmental experience and indirectly through rules and instructions. Behavior analytic research has shown that instructions can control behavior, even when such behavior leads to sub-optimal outcomes (Hayes, S. (Ed.). 1989. Rule-governed behavior: cognition, contingencies, and instructional control. Plenum Press.). Here we examine the control of behavior through instructions in a reinforcement learning task known to depend on striatal dopaminergic function. Participants selected between probabilistically reinforced stimuli, and were (incorrectly) told that a specific stimulus had the highest (or lowest) reinforcement probability. Despite experience to the contrary, instructions drove choice behavior. We present neural network simulations that capture the interactions between instruction-driven and reinforcement-driven behavior via two potential neural circuits: one in which the striatum is inaccurately trained by instruction representations coming from prefrontal cortex/hippocampus (PFC/HC), and another in which the striatum learns the environmentally based reinforcement contingencies, but is "overridden" at decision output. Both models capture the core behavioral phenomena but, because they differ fundamentally on what is learned, make distinct predictions for subsequent behavioral and neuroimaging experiments. Finally, we attempt to distinguish between the proposed computational mechanisms governing instructed behavior by fitting a series of abstract "Q-learning" and Bayesian models to subject data. The best-fitting model supports one of the neural models, suggesting the existence of a "confirmation bias" in which the PFC/HC system trains the reinforcement system by amplifying outcomes that are consistent with instructions while diminishing inconsistent outcomes.

  2. The Computational Development of Reinforcement Learning during Adolescence.

    Directory of Open Access Journals (Sweden)

    Stefano Palminteri

    2016-06-01

    Full Text Available Adolescence is a period of life characterised by changes in learning and decision-making. Learning and decision-making do not rely on a unitary system, but instead require the coordination of different cognitive processes that can be mathematically formalised as dissociable computational modules. Here, we aimed to trace the developmental time-course of the computational modules responsible for learning from reward or punishment, and learning from counterfactual feedback. Adolescents and adults carried out a novel reinforcement learning paradigm in which participants learned the association between cues and probabilistic outcomes, where the outcomes differed in valence (reward versus punishment and feedback was either partial or complete (either the outcome of the chosen option only, or the outcomes of both the chosen and unchosen option, were displayed. Computational strategies changed during development: whereas adolescents' behaviour was better explained by a basic reinforcement learning algorithm, adults' behaviour integrated increasingly complex computational features, namely a counterfactual learning module (enabling enhanced performance in the presence of complete feedback and a value contextualisation module (enabling symmetrical reward and punishment learning. Unlike adults, adolescent performance did not benefit from counterfactual (complete feedback. In addition, while adults learned symmetrically from both reward and punishment, adolescents learned from reward but were less likely to learn from punishment. This tendency to rely on rewards and not to consider alternative consequences of actions might contribute to our understanding of decision-making in adolescence.

  3. SCAFFOLDINGAND REINFORCEMENT: USING DIGITAL LOGBOOKS IN LEARNING VOCABULARY

    OpenAIRE

    Khalifa, Salma Hasan Almabrouk; Shabdin, Ahmad Affendi

    2016-01-01

    Reinforcement and scaffolding are tested approaches to enhance learning achievements. Keeping a record of the learning process as well as the new learned words functions as scaffolding to help learners build a comprehensive vocabulary. Similarly, repetitive learning of new words reinforces permanent learning for long-term memory. Paper-based logbooks may prove to be good records of the learning process, but if learners use digital logbooks, the results may be even better. Digital logbooks wit...

  4. Episodic reinforcement learning control approach for biped walking

    Directory of Open Access Journals (Sweden)

    Katić Duško

    2012-01-01

    Full Text Available This paper presents a hybrid dynamic control approach to the realization of humanoid biped robotic walk, focusing on the policy gradient episodic reinforcement learning with fuzzy evaluative feedback. The proposed structure of controller involves two feedback loops: a conventional computed torque controller and an episodic reinforcement learning controller. The reinforcement learning part includes fuzzy information about Zero-Moment- Point errors. Simulation tests using a medium-size 36-DOF humanoid robot MEXONE were performed to demonstrate the effectiveness of our method.

  5. Reinforcement learning using a continuous time actor-critic framework with spiking neurons.

    Directory of Open Access Journals (Sweden)

    Nicolas Frémaux

    2013-04-01

    Full Text Available Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD learning of Doya (2000 to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

  6. Reinforcement Learning Explains Conditional Cooperation and Its Moody Cousin.

    Directory of Open Access Journals (Sweden)

    Takahiro Ezaki

    2016-07-01

    Full Text Available Direct reciprocity, or repeated interaction, is a main mechanism to sustain cooperation under social dilemmas involving two individuals. For larger groups and networks, which are probably more relevant to understanding and engineering our society, experiments employing repeated multiplayer social dilemma games have suggested that humans often show conditional cooperation behavior and its moody variant. Mechanisms underlying these behaviors largely remain unclear. Here we provide a proximate account for this behavior by showing that individuals adopting a type of reinforcement learning, called aspiration learning, phenomenologically behave as conditional cooperator. By definition, individuals are satisfied if and only if the obtained payoff is larger than a fixed aspiration level. They reinforce actions that have resulted in satisfactory outcomes and anti-reinforce those yielding unsatisfactory outcomes. The results obtained in the present study are general in that they explain extant experimental results obtained for both so-called moody and non-moody conditional cooperation, prisoner's dilemma and public goods games, and well-mixed groups and networks. Different from the previous theory, individuals are assumed to have no access to information about what other individuals are doing such that they cannot explicitly use conditional cooperation rules. In this sense, myopic aspiration learning in which the unconditional propensity of cooperation is modulated in every discrete time step explains conditional behavior of humans. Aspiration learners showing (moody conditional cooperation obeyed a noisy GRIM-like strategy. This is different from the Pavlov, a reinforcement learning strategy promoting mutual cooperation in two-player situations.

  7. Belief reward shaping in reinforcement learning

    CSIR Research Space (South Africa)

    Marom, O

    2018-02-01

    Full Text Available A key challenge in many reinforcement learning problems is delayed rewards, which can significantly slow down learning. Although reward shaping has previously been introduced to accelerate learning by bootstrapping an agent with additional...

  8. Adaptive representations for reinforcement learning

    NARCIS (Netherlands)

    Whiteson, S.

    2010-01-01

    This book presents new algorithms for reinforcement learning, a form of machine learning in which an autonomous agent seeks a control policy for a sequential decision task. Since current methods typically rely on manually designed solution representations, agents that automatically adapt their own

  9. Punishment Insensitivity and Impaired Reinforcement Learning in Preschoolers

    Science.gov (United States)

    Briggs-Gowan, Margaret J.; Nichols, Sara R.; Voss, Joel; Zobel, Elvira; Carter, Alice S.; McCarthy, Kimberly J.; Pine, Daniel S.; Blair, James; Wakschlag, Lauren S.

    2014-01-01

    Background: Youth and adults with psychopathic traits display disrupted reinforcement learning. Advances in measurement now enable examination of this association in preschoolers. The current study examines relations between reinforcement learning in preschoolers and parent ratings of reduced responsiveness to socialization, conceptualized as a…

  10. Reinforcement learning in continuous state and action spaces

    NARCIS (Netherlands)

    H. P. van Hasselt (Hado); M.A. Wiering; M. van Otterlo

    2012-01-01

    textabstractMany traditional reinforcement-learning algorithms have been designed for problems with small finite state and action spaces. Learning in such discrete problems can been difficult, due to noise and delayed reinforcements. However, many real-world problems have continuous state or action

  11. Autonomous reinforcement learning with experience replay.

    Science.gov (United States)

    Wawrzyński, Paweł; Tanwani, Ajay Kumar

    2013-05-01

    This paper considers the issues of efficiency and autonomy that are required to make reinforcement learning suitable for real-life control tasks. A real-time reinforcement learning algorithm is presented that repeatedly adjusts the control policy with the use of previously collected samples, and autonomously estimates the appropriate step-sizes for the learning updates. The algorithm is based on the actor-critic with experience replay whose step-sizes are determined on-line by an enhanced fixed point algorithm for on-line neural network training. An experimental study with simulated octopus arm and half-cheetah demonstrates the feasibility of the proposed algorithm to solve difficult learning control problems in an autonomous way within reasonably short time. Copyright © 2012 Elsevier Ltd. All rights reserved.

  12. Reinforcement Learning in Repeated Portfolio Decisions

    OpenAIRE

    Diao, Linan; Rieskamp, Jörg

    2011-01-01

    How do people make investment decisions when they receive outcome feedback? We examined how well the standard mean-variance model and two reinforcement models predict people's portfolio decisions. The basic reinforcement model predicts a learning process that relies solely on the portfolio's overall return, whereas the proposed extended reinforcement model also takes the risk and covariance of the investments into account. The experimental results illustrate that people reacted sensitively to...

  13. Reinforcement learning improves behaviour from evaluative feedback

    Science.gov (United States)

    Littman, Michael L.

    2015-05-01

    Reinforcement learning is a branch of machine learning concerned with using experience gained through interacting with the world and evaluative feedback to improve a system's ability to make behavioural decisions. It has been called the artificial intelligence problem in a microcosm because learning algorithms must act autonomously to perform well and achieve their goals. Partly driven by the increasing availability of rich data, recent years have seen exciting advances in the theory and practice of reinforcement learning, including developments in fundamental technical areas such as generalization, planning, exploration and empirical methodology, leading to increasing applicability to real-life problems.

  14. Solution to reinforcement learning problems with artificial potential field

    Institute of Scientific and Technical Information of China (English)

    XIE Li-juan; XIE Guang-rong; CHEN Huan-wen; LI Xiao-li

    2008-01-01

    A novel method was designed to solve reinforcement learning problems with artificial potential field. Firstly a reinforcement learning problem was transferred to a path planning problem by using artificial potential field(APF), which was a very appropriate method to model a reinforcement learning problem. Secondly, a new APF algorithm was proposed to overcome the local minimum problem in the potential field methods with a virtual water-flow concept. The performance of this new method was tested by a gridworld problem named as key and door maze. The experimental results show that within 45 trials, good and deterministic policies are found in almost all simulations. In comparison with WIERING's HQ-learning system which needs 20 000 trials for stable solution, the proposed new method can obtain optimal and stable policy far more quickly than HQ-learning. Therefore, the new method is simple and effective to give an optimal solution to the reinforcement learning problem.

  15. Adolescent-specific patterns of behavior and neural activity during social reinforcement learning.

    Science.gov (United States)

    Jones, Rebecca M; Somerville, Leah H; Li, Jian; Ruberry, Erika J; Powers, Alisa; Mehta, Natasha; Dyke, Jonathan; Casey, B J

    2014-06-01

    Humans are sophisticated social beings. Social cues from others are exceptionally salient, particularly during adolescence. Understanding how adolescents interpret and learn from variable social signals can provide insight into the observed shift in social sensitivity during this period. The present study tested 120 participants between the ages of 8 and 25 years on a social reinforcement learning task where the probability of receiving positive social feedback was parametrically manipulated. Seventy-eight of these participants completed the task during fMRI scanning. Modeling trial-by-trial learning, children and adults showed higher positive learning rates than did adolescents, suggesting that adolescents demonstrated less differentiation in their reaction times for peers who provided more positive feedback. Forming expectations about receiving positive social reinforcement correlated with neural activity within the medial prefrontal cortex and ventral striatum across age. Adolescents, unlike children and adults, showed greater insular activity during positive prediction error learning and increased activity in the supplementary motor cortex and the putamen when receiving positive social feedback regardless of the expected outcome, suggesting that peer approval may motivate adolescents toward action. While different amounts of positive social reinforcement enhanced learning in children and adults, all positive social reinforcement equally motivated adolescents. Together, these findings indicate that sensitivity to peer approval during adolescence goes beyond simple reinforcement theory accounts and suggest possible explanations for how peers may motivate adolescent behavior.

  16. A reward optimization method based on action subrewards in hierarchical reinforcement learning.

    Science.gov (United States)

    Fu, Yuchen; Liu, Quan; Ling, Xionghong; Cui, Zhiming

    2014-01-01

    Reinforcement learning (RL) is one kind of interactive learning methods. Its main characteristics are "trial and error" and "related reward." A hierarchical reinforcement learning method based on action subrewards is proposed to solve the problem of "curse of dimensionality," which means that the states space will grow exponentially in the number of features and low convergence speed. The method can reduce state spaces greatly and choose actions with favorable purpose and efficiency so as to optimize reward function and enhance convergence speed. Apply it to the online learning in Tetris game, and the experiment result shows that the convergence speed of this algorithm can be enhanced evidently based on the new method which combines hierarchical reinforcement learning algorithm and action subrewards. The "curse of dimensionality" problem is also solved to a certain extent with hierarchical method. All the performance with different parameters is compared and analyzed as well.

  17. Characterizing Reinforcement Learning Methods through Parameterized Learning Problems

    Science.gov (United States)

    2011-06-03

    extraneous. The agent could potentially adapt these representational aspects by applying methods from feature selection ( Kolter and Ng, 2009; Petrik et al...611–616. AAAI Press. Kolter , J. Z. and Ng, A. Y. (2009). Regularization and feature selection in least-squares temporal difference learning. In A. P

  18. Deep reinforcement learning for automated radiation adaptation in lung cancer.

    Science.gov (United States)

    Tseng, Huan-Hsin; Luo, Yi; Cui, Sunan; Chien, Jen-Tzung; Ten Haken, Randall K; Naqa, Issam El

    2017-12-01

    escalation/de-escalation between 1.5 and 3.8 Gy, a range similar to that used in the clinical protocol. The same DQN yielded two patterns of dose escalation for the 34 test patients, but with different reward variants. First, using the baseline P+ reward function, individual adaptive fraction doses of the DQN had similar tendencies to the clinical data with an RMSE = 0.76 Gy; but adaptations suggested by the DQN were generally lower in magnitude (less aggressive). Second, by adjusting the P+ reward function with higher emphasis on mitigating local failure, better matching of doses between the DQN and the clinical protocol was achieved with an RMSE = 0.5 Gy. Moreover, the decisions selected by the DQN seemed to have better concordance with patients eventual outcomes. In comparison, the traditional temporal difference (TD) algorithm for reinforcement learning yielded an RMSE = 3.3 Gy due to numerical instabilities and lack of sufficient learning. We demonstrated that automated dose adaptation by DRL is a feasible and a promising approach for achieving similar results to those chosen by clinicians. The process may require customization of the reward function if individual cases were to be considered. However, development of this framework into a fully credible autonomous system for clinical decision support would require further validation on larger multi-institutional datasets. © 2017 American Association of Physicists in Medicine.

  19. Reinforcement and inference in cross-situational word learning.

    Science.gov (United States)

    Tilles, Paulo F C; Fontanari, José F

    2013-01-01

    Cross-situational word learning is based on the notion that a learner can determine the referent of a word by finding something in common across many observed uses of that word. Here we propose an adaptive learning algorithm that contains a parameter that controls the strength of the reinforcement applied to associations between concurrent words and referents, and a parameter that regulates inference, which includes built-in biases, such as mutual exclusivity, and information of past learning events. By adjusting these parameters so that the model predictions agree with data from representative experiments on cross-situational word learning, we were able to explain the learning strategies adopted by the participants of those experiments in terms of a trade-off between reinforcement and inference. These strategies can vary wildly depending on the conditions of the experiments. For instance, for fast mapping experiments (i.e., the correct referent could, in principle, be inferred in a single observation) inference is prevalent, whereas for segregated contextual diversity experiments (i.e., the referents are separated in groups and are exhibited with members of their groups only) reinforcement is predominant. Other experiments are explained with more balanced doses of reinforcement and inference.

  20. The combination of appetitive and aversive reinforcers and the nature of their interaction during auditory learning.

    Science.gov (United States)

    Ilango, A; Wetzel, W; Scheich, H; Ohl, F W

    2010-03-31

    Learned changes in behavior can be elicited by either appetitive or aversive reinforcers. It is, however, not clear whether the two types of motivation, (approaching appetitive stimuli and avoiding aversive stimuli) drive learning in the same or different ways, nor is their interaction understood in situations where the two types are combined in a single experiment. To investigate this question we have developed a novel learning paradigm for Mongolian gerbils, which not only allows rewards and punishments to be presented in isolation or in combination with each other, but also can use these opposite reinforcers to drive the same learned behavior. Specifically, we studied learning of tone-conditioned hurdle crossing in a shuttle box driven by either an appetitive reinforcer (brain stimulation reward) or an aversive reinforcer (electrical footshock), or by a combination of both. Combination of the two reinforcers potentiated speed of acquisition, led to maximum possible performance, and delayed extinction as compared to either reinforcer alone. Additional experiments, using partial reinforcement protocols and experiments in which one of the reinforcers was omitted after the animals had been previously trained with the combination of both reinforcers, indicated that appetitive and aversive reinforcers operated together but acted in different ways: in this particular experimental context, punishment appeared to be more effective for initial acquisition and reward more effective to maintain a high level of conditioned responses (CRs). The results imply that learning mechanisms in problem solving were maximally effective when the initial punishment of mistakes was combined with the subsequent rewarding of correct performance. Copyright 2010 IBRO. Published by Elsevier Ltd. All rights reserved.

  1. Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening

    OpenAIRE

    He, Frank S.; Liu, Yang; Schwing, Alexander G.; Peng, Jian

    2016-01-01

    We propose a novel training algorithm for reinforcement learning which combines the strength of deep Q-learning with a constrained optimization approach to tighten optimality and encourage faster reward propagation. Our novel technique makes deep reinforcement learning more practical by drastically reducing the training time. We evaluate the performance of our approach on the 49 games of the challenging Arcade Learning Environment, and report significant improvements in both training time and...

  2. Manifold Regularized Reinforcement Learning.

    Science.gov (United States)

    Li, Hongliang; Liu, Derong; Wang, Ding

    2018-04-01

    This paper introduces a novel manifold regularized reinforcement learning scheme for continuous Markov decision processes. Smooth feature representations for value function approximation can be automatically learned using the unsupervised manifold regularization method. The learned features are data-driven, and can be adapted to the geometry of the state space. Furthermore, the scheme provides a direct basis representation extension for novel samples during policy learning and control. The performance of the proposed scheme is evaluated on two benchmark control tasks, i.e., the inverted pendulum and the energy storage problem. Simulation results illustrate the concepts of the proposed scheme and show that it can obtain excellent performance.

  3. Online Pedagogical Tutorial Tactics Optimization Using Genetic-Based Reinforcement Learning.

    Science.gov (United States)

    Lin, Hsuan-Ta; Lee, Po-Ming; Hsiao, Tzu-Chien

    2015-01-01

    Tutorial tactics are policies for an Intelligent Tutoring System (ITS) to decide the next action when there are multiple actions available. Recent research has demonstrated that when the learning contents were controlled so as to be the same, different tutorial tactics would make difference in students' learning gains. However, the Reinforcement Learning (RL) techniques that were used in previous studies to induce tutorial tactics are insufficient when encountering large problems and hence were used in offline manners. Therefore, we introduced a Genetic-Based Reinforcement Learning (GBML) approach to induce tutorial tactics in an online-learning manner without basing on any preexisting dataset. The introduced method can learn a set of rules from the environment in a manner similar to RL. It includes a genetic-based optimizer for rule discovery task by generating new rules from the old ones. This increases the scalability of a RL learner for larger problems. The results support our hypothesis about the capability of the GBML method to induce tutorial tactics. This suggests that the GBML method should be favorable in developing real-world ITS applications in the domain of tutorial tactics induction.

  4. Reinforcement learning or active inference?

    Science.gov (United States)

    Friston, Karl J; Daunizeau, Jean; Kiebel, Stefan J

    2009-07-29

    This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

  5. Reinforcement learning or active inference?

    Directory of Open Access Journals (Sweden)

    Karl J Friston

    2009-07-01

    Full Text Available This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain.

  6. Decentralized Reinforcement Learning of robot behaviors

    NARCIS (Netherlands)

    Leottau, David L.; Ruiz-del-Solar, Javier; Babuska, R.

    2018-01-01

    A multi-agent methodology is proposed for Decentralized Reinforcement Learning (DRL) of individual behaviors in problems where multi-dimensional action spaces are involved. When using this methodology, sub-tasks are learned in parallel by individual agents working toward a common goal. In

  7. Continuous residual reinforcement learning for traffic signal control optimization

    NARCIS (Netherlands)

    Aslani, Mohammad; Seipel, Stefan; Wiering, Marco

    2018-01-01

    Traffic signal control can be naturally regarded as a reinforcement learning problem. Unfortunately, it is one of the most difficult classes of reinforcement learning problems owing to its large state space. A straightforward approach to address this challenge is to control traffic signals based on

  8. Reinforcement Learning for Ramp Control: An Analysis of Learning Parameters

    Directory of Open Access Journals (Sweden)

    Chao Lu

    2016-08-01

    Full Text Available Reinforcement Learning (RL has been proposed to deal with ramp control problems under dynamic traffic conditions; however, there is a lack of sufficient research on the behaviour and impacts of different learning parameters. This paper describes a ramp control agent based on the RL mechanism and thoroughly analyzed the influence of three learning parameters; namely, learning rate, discount rate and action selection parameter on the algorithm performance. Two indices for the learning speed and convergence stability were used to measure the algorithm performance, based on which a series of simulation-based experiments were designed and conducted by using a macroscopic traffic flow model. Simulation results showed that, compared with the discount rate, the learning rate and action selection parameter made more remarkable impacts on the algorithm performance. Based on the analysis, some suggestionsabout how to select suitable parameter values that can achieve a superior performance were provided.

  9. Sex differences in verbal and nonverbal learning before and after temporal lobe epilepsy surgery.

    Science.gov (United States)

    Berger, Justus; Oltmanns, Frank; Holtkamp, Martin; Bengner, Thomas

    2017-01-01

    Women outperform men in a host of episodic memory tasks, yet the neuroanatomical basis for this effect is unclear. It has been suggested that the anterior temporal lobe might be especially relevant for sex differences in memory. In the current study, we investigated whether temporal lobe epilepsy (TLE) has an influence on sex effects in learning and memory and whether women and men with TLE differ in their risk for memory deficits after epilepsy surgery. 177 patients (53 women and 41 men with left TLE, 42 women and 41 men with right TLE) were neuropsychologically tested before and one year after temporal lobe resection. We found that women with TLE had better verbal, but not figural, memory than men with TLE. The female advantage in verbal memory was not affected by temporal lobe resection. The same pattern of results was found in a more homogeneous subsample of 84 patients with only hippocampal sclerosis who were seizure-free after surgery. Our findings challenge the concept that the anterior temporal lobe plays a central role in the verbal memory advantage for women. Copyright © 2016 Elsevier Inc. All rights reserved.

  10. Further tests of the Scalar Expectancy Theory (SET) and the Learning-to-Time (LeT) model in a temporal bisection task.

    Science.gov (United States)

    Machado, Armando; Arantes, Joana

    2006-06-01

    To contrast two models of timing, Scalar Expectancy Theory (SET) and Learning to Time (LeT), pigeons were exposed to a double temporal bisection procedure. On half of the trials, they learned to choose a red key after a 1s signal and a green key after a 4s signal; on the other half of the trials, they learned to choose a blue key after a 4-s signal and a yellow key after a 16-s signal. This was Phase A of an ABA design. On Phase B, the pigeons were divided into two groups and exposed to a new bisection task in which the signals ranged from 1 to 16s and the choice keys were blue and green. One group was reinforced for choosing blue after 1-s signals and green after 16-s signals and the other group was reinforced for the opposite mapping (green after 1-s signals and blue after 16-s signals). Whereas SET predicted no differences between the groups, LeT predicted that the former group would learn the new discrimination faster than the latter group. The results were consistent with LeT. Finally, the pigeons returned to Phase A. Only LeT made specific predictions regarding the reacquisition of the four temporal discriminations. These predictions were only partly consistent with the results.

  11. Human demonstrations for fast and safe exploration in reinforcement learning

    NARCIS (Netherlands)

    Schonebaum, G.K.; Junell, J.L.; van Kampen, E.

    2017-01-01

    Reinforcement learning is a promising framework for controlling complex vehicles with a high level of autonomy, since it does not need a dynamic model of the vehicle, and it is able to adapt to changing conditions. When learning from scratch, the performance of a reinforcement learning controller

  12. Reinforcement Learning in Continuous Action Spaces

    NARCIS (Netherlands)

    Hasselt, H. van; Wiering, M.A.

    2007-01-01

    Quite some research has been done on Reinforcement Learning in continuous environments, but the research on problems where the actions can also be chosen from a continuous space is much more limited. We present a new class of algorithms named Continuous Actor Critic Learning Automaton (CACLA)

  13. Pleasurable music affects reinforcement learning according to the listener

    Science.gov (United States)

    Gold, Benjamin P.; Frank, Michael J.; Bogert, Brigitte; Brattico, Elvira

    2013-01-01

    Mounting evidence links the enjoyment of music to brain areas implicated in emotion and the dopaminergic reward system. In particular, dopamine release in the ventral striatum seems to play a major role in the rewarding aspect of music listening. Striatal dopamine also influences reinforcement learning, such that subjects with greater dopamine efficacy learn better to approach rewards while those with lesser dopamine efficacy learn better to avoid punishments. In this study, we explored the practical implications of musical pleasure through its ability to facilitate reinforcement learning via non-pharmacological dopamine elicitation. Subjects from a wide variety of musical backgrounds chose a pleasurable and a neutral piece of music from an experimenter-compiled database, and then listened to one or both of these pieces (according to pseudo-random group assignment) as they performed a reinforcement learning task dependent on dopamine transmission. We assessed musical backgrounds as well as typical listening patterns with the new Helsinki Inventory of Music and Affective Behaviors (HIMAB), and separately investigated behavior for the training and test phases of the learning task. Subjects with more musical experience trained better with neutral music and tested better with pleasurable music, while those with less musical experience exhibited the opposite effect. HIMAB results regarding listening behaviors and subjective music ratings indicate that these effects arose from different listening styles: namely, more affective listening in non-musicians and more analytical listening in musicians. In conclusion, musical pleasure was able to influence task performance, and the shape of this effect depended on group and individual factors. These findings have implications in affective neuroscience, neuroaesthetics, learning, and music therapy. PMID:23970875

  14. Exploring the spatio-temporal neural basis of face learning

    Science.gov (United States)

    Yang, Ying; Xu, Yang; Jew, Carol A.; Pyles, John A.; Kass, Robert E.; Tarr, Michael J.

    2017-01-01

    Humans are experts at face individuation. Although previous work has identified a network of face-sensitive regions and some of the temporal signatures of face processing, as yet, we do not have a clear understanding of how such face-sensitive regions support learning at different time points. To study the joint spatio-temporal neural basis of face learning, we trained subjects to categorize two groups of novel faces and recorded their neural responses using magnetoencephalography (MEG) throughout learning. A regression analysis of neural responses in face-sensitive regions against behavioral learning curves revealed significant correlations with learning in the majority of the face-sensitive regions in the face network, mostly between 150–250 ms, but also after 300 ms. However, the effect was smaller in nonventral regions (within the superior temporal areas and prefrontal cortex) than that in the ventral regions (within the inferior occipital gyri (IOG), midfusiform gyri (mFUS) and anterior temporal lobes). A multivariate discriminant analysis also revealed that IOG and mFUS, which showed strong correlation effects with learning, exhibited significant discriminability between the two face categories at different time points both between 150–250 ms and after 300 ms. In contrast, the nonventral face-sensitive regions, where correlation effects with learning were smaller, did exhibit some significant discriminability, but mainly after 300 ms. In sum, our findings indicate that early and recurring temporal components arising from ventral face-sensitive regions are critically involved in learning new faces. PMID:28570739

  15. A Review of the Relationship between Novelty, Intrinsic Motivation and Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Siddique Nazmul

    2017-11-01

    Full Text Available This paper presents a review on the tri-partite relationship between novelty, intrinsic motivation and reinforcement learning. The paper first presents a literature survey on novelty and the different computational models of novelty detection, with a specific focus on the features of stimuli that trigger a Hedonic value for generating a novelty signal. It then presents an overview of intrinsic motivation and investigations into different models with the aim of exploring deeper co-relationships between specific features of a novelty signal and its effect on intrinsic motivation in producing a reward function. Finally, it presents survey results on reinforcement learning, different models and their functional relationship with intrinsic motivation.

  16. Human reinforcement learning subdivides structured action spaces by learning effector-specific values.

    Science.gov (United States)

    Gershman, Samuel J; Pesaran, Bijan; Daw, Nathaniel D

    2009-10-28

    Humans and animals are endowed with a large number of effectors. Although this enables great behavioral flexibility, it presents an equally formidable reinforcement learning problem of discovering which actions are most valuable because of the high dimensionality of the action space. An unresolved question is how neural systems for reinforcement learning-such as prediction error signals for action valuation associated with dopamine and the striatum-can cope with this "curse of dimensionality." We propose a reinforcement learning framework that allows for learned action valuations to be decomposed into effector-specific components when appropriate to a task, and test it by studying to what extent human behavior and blood oxygen level-dependent (BOLD) activity can exploit such a decomposition in a multieffector choice task. Subjects made simultaneous decisions with their left and right hands and received separate reward feedback for each hand movement. We found that choice behavior was better described by a learning model that decomposed the values of bimanual movements into separate values for each effector, rather than a traditional model that treated the bimanual actions as unitary with a single value. A decomposition of value into effector-specific components was also observed in value-related BOLD signaling, in the form of lateralized biases in striatal correlates of prediction error and anticipatory value correlates in the intraparietal sulcus. These results suggest that the human brain can use decomposed value representations to "divide and conquer" reinforcement learning over high-dimensional action spaces.

  17. Evolutionary computation for reinforcement learning

    NARCIS (Netherlands)

    Whiteson, S.; Wiering, M.; van Otterlo, M.

    2012-01-01

    Algorithms for evolutionary computation, which simulate the process of natural selection to solve optimization problems, are an effective tool for discovering high-performing reinforcement-learning policies. Because they can automatically find good representations, handle continuous action spaces,

  18. Optimizing microstimulation using a reinforcement learning framework.

    Science.gov (United States)

    Brockmeier, Austin J; Choi, John S; Distasio, Marcello M; Francis, Joseph T; Príncipe, José C

    2011-01-01

    The ability to provide sensory feedback is desired to enhance the functionality of neuroprosthetics. Somatosensory feedback provides closed-loop control to the motor system, which is lacking in feedforward neuroprosthetics. In the case of existing somatosensory function, a template of the natural response can be used as a template of desired response elicited by electrical microstimulation. In the case of no initial training data, microstimulation parameters that produce responses close to the template must be selected in an online manner. We propose using reinforcement learning as a framework to balance the exploration of the parameter space and the continued selection of promising parameters for further stimulation. This approach avoids an explicit model of the neural response from stimulation. We explore a preliminary architecture--treating the task as a k-armed bandit--using offline data recorded for natural touch and thalamic microstimulation, and we examine the methods efficiency in exploring the parameter space while concentrating on promising parameter forms. The best matching stimulation parameters, from k = 68 different forms, are selected by the reinforcement learning algorithm consistently after 334 realizations.

  19. Insights in reinforcement rearning : formal analysis and empirical evaluation of temporal-difference learning algorithms

    NARCIS (Netherlands)

    van Hasselt, H.P.

    2011-01-01

    A key aspect of artificial intelligence is the ability to learn from experience. If examples of correct solutions exist, supervised learning techniques can be used to predict what the correct solution will be for future observations. However, often such examples are not readily available. The field

  20. Reinforcement learning for optimal control of low exergy buildings

    International Nuclear Information System (INIS)

    Yang, Lei; Nagy, Zoltan; Goffin, Philippe; Schlueter, Arno

    2015-01-01

    Highlights: • Implementation of reinforcement learning control for LowEx Building systems. • Learning allows adaptation to local environment without prior knowledge. • Presentation of reinforcement learning control for real-life applications. • Discussion of the applicability for real-life situations. - Abstract: Over a third of the anthropogenic greenhouse gas (GHG) emissions stem from cooling and heating buildings, due to their fossil fuel based operation. Low exergy building systems are a promising approach to reduce energy consumption as well as GHG emissions. They consists of renewable energy technologies, such as PV, PV/T and heat pumps. Since careful tuning of parameters is required, a manual setup may result in sub-optimal operation. A model predictive control approach is unnecessarily complex due to the required model identification. Therefore, in this work we present a reinforcement learning control (RLC) approach. The studied building consists of a PV/T array for solar heat and electricity generation, as well as geothermal heat pumps. We present RLC for the PV/T array, and the full building model. Two methods, Tabular Q-learning and Batch Q-learning with Memory Replay, are implemented with real building settings and actual weather conditions in a Matlab/Simulink framework. The performance is evaluated against standard rule-based control (RBC). We investigated different neural network structures and find that some outperformed RBC already during the learning phase. Overall, every RLC strategy for PV/T outperformed RBC by over 10% after the third year. Likewise, for the full building, RLC outperforms RBC in terms of meeting the heating demand, maintaining the optimal operation temperature and compensating more effectively for ground heat. This allows to reduce engineering costs associated with the setup of these systems, as well as decrease the return-of-invest period, both of which are necessary to create a sustainable, zero-emission building

  1. Human reinforcement learning subdivides structured action spaces by learning effector-specific values

    OpenAIRE

    Gershman, Samuel J.; Pesaran, Bijan; Daw, Nathaniel D.

    2009-01-01

    Humans and animals are endowed with a large number of effectors. Although this enables great behavioral flexibility, it presents an equally formidable reinforcement learning problem of discovering which actions are most valuable, due to the high dimensionality of the action space. An unresolved question is how neural systems for reinforcement learning – such as prediction error signals for action valuation associated with dopamine and the striatum – can cope with this “curse of dimensionality...

  2. Reinforcement Learning Based Novel Adaptive Learning Framework for Smart Grid Prediction

    Directory of Open Access Journals (Sweden)

    Tian Li

    2017-01-01

    Full Text Available Smart grid is a potential infrastructure to supply electricity demand for end users in a safe and reliable manner. With the rapid increase of the share of renewable energy and controllable loads in smart grid, the operation uncertainty of smart grid has increased briskly during recent years. The forecast is responsible for the safety and economic operation of the smart grid. However, most existing forecast methods cannot account for the smart grid due to the disabilities to adapt to the varying operational conditions. In this paper, reinforcement learning is firstly exploited to develop an online learning framework for the smart grid. With the capability of multitime scale resolution, wavelet neural network has been adopted in the online learning framework to yield reinforcement learning and wavelet neural network (RLWNN based adaptive learning scheme. The simulations on two typical prediction problems in smart grid, including wind power prediction and load forecast, validate the effectiveness and the scalability of the proposed RLWNN based learning framework and algorithm.

  3. Activity in the superior temporal sulcus highlights learning competence in an interaction game.

    Science.gov (United States)

    Haruno, Masahiko; Kawato, Mitsuo

    2009-04-08

    During behavioral adaptation through interaction with human and nonhuman agents, marked individual differences are seen in both real-life situations and games. However, the underlying neural mechanism is not well understood. We conducted a neuroimaging experiment in which subjects maximized monetary rewards by learning in a prisoner's dilemma game with two computer agents: agent A, a tit-for-tat player who repeats the subject's previous action, and agent B, a simple stochastic cooperator oblivious to the subject's action. Approximately 1/3 of the subjects (group I) learned optimally in relation to both A and B, while another 1/3 (group II) did so only for B. Post-experiment interviews indicated that group I exploited the agent strategies more often than group II. Significant differences in learning-related brain activity between the two groups were only found in the superior temporal sulcus (STS) for both A and B. Furthermore, the learning performance of each group I subject was predictable based on this STS activity, but not in the group II subjects. This differential activity could not be attributed to a behavioral difference since it persisted in relation to agent B for which the two groups behaved similarly. In sharp contrast, the brain structures for reward processing were recruited similarly by both groups. These results suggest that STS provides knowledge of the other agent's strategies for association between action and reward and highlights learning competence during interactive reinforcement learning.

  4. Pragmatically Framed Cross-Situational Noun Learning Using Computational Reinforcement Models.

    Science.gov (United States)

    Najnin, Shamima; Banerjee, Bonny

    2018-01-01

    Cross-situational learning and social pragmatic theories are prominent mechanisms for learning word meanings (i.e., word-object pairs). In this paper, the role of reinforcement is investigated for early word-learning by an artificial agent. When exposed to a group of speakers, the agent comes to understand an initial set of vocabulary items belonging to the language used by the group. Both cross-situational learning and social pragmatic theory are taken into account. As social cues, joint attention and prosodic cues in caregiver's speech are considered. During agent-caregiver interaction, the agent selects a word from the caregiver's utterance and learns the relations between that word and the objects in its visual environment. The "novel words to novel objects" language-specific constraint is assumed for computing rewards. The models are learned by maximizing the expected reward using reinforcement learning algorithms [i.e., table-based algorithms: Q-learning, SARSA, SARSA-λ, and neural network-based algorithms: Q-learning for neural network (Q-NN), neural-fitted Q-network (NFQ), and deep Q-network (DQN)]. Neural network-based reinforcement learning models are chosen over table-based models for better generalization and quicker convergence. Simulations are carried out using mother-infant interaction CHILDES dataset for learning word-object pairings. Reinforcement is modeled in two cross-situational learning cases: (1) with joint attention (Attentional models), and (2) with joint attention and prosodic cues (Attentional-prosodic models). Attentional-prosodic models manifest superior performance to Attentional ones for the task of word-learning. The Attentional-prosodic DQN outperforms existing word-learning models for the same task.

  5. Learning Similar Actions by Reinforcement or Sensory-Prediction Errors Rely on Distinct Physiological Mechanisms.

    Science.gov (United States)

    Uehara, Shintaro; Mawase, Firas; Celnik, Pablo

    2017-09-14

    Humans can acquire knowledge of new motor behavior via different forms of learning. The two forms most commonly studied have been the development of internal models based on sensory-prediction errors (error-based learning) and success-based feedback (reinforcement learning). Human behavioral studies suggest these are distinct learning processes, though the neurophysiological mechanisms that are involved have not been characterized. Here, we evaluated physiological markers from the cerebellum and the primary motor cortex (M1) using noninvasive brain stimulations while healthy participants trained finger-reaching tasks. We manipulated the extent to which subjects rely on error-based or reinforcement by providing either vector or binary feedback about task performance. Our results demonstrated a double dissociation where learning the task mainly via error-based mechanisms leads to cerebellar plasticity modifications but not long-term potentiation (LTP)-like plasticity changes in M1; while learning a similar action via reinforcement mechanisms elicited M1 LTP-like plasticity but not cerebellar plasticity changes. Our findings indicate that learning complex motor behavior is mediated by the interplay of different forms of learning, weighing distinct neural mechanisms in M1 and the cerebellum. Our study provides insights for designing effective interventions to enhance human motor learning. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  6. Universal effect of dynamical reinforcement learning mechanism in spatial evolutionary games

    International Nuclear Information System (INIS)

    Zhang, Hai-Feng; Wu, Zhi-Xi; Wang, Bing-Hong

    2012-01-01

    One of the prototypical mechanisms in understanding the ubiquitous cooperation in social dilemma situations is the win–stay, lose–shift rule. In this work, a generalized win–stay, lose–shift learning model—a reinforcement learning model with dynamic aspiration level—is proposed to describe how humans adapt their social behaviors based on their social experiences. In the model, the players incorporate the information of the outcomes in previous rounds with time-dependent aspiration payoffs to regulate the probability of choosing cooperation. By investigating such a reinforcement learning rule in the spatial prisoner's dilemma game and public goods game, a most noteworthy viewpoint is that moderate greediness (i.e. moderate aspiration level) favors best the development and organization of collective cooperation. The generality of this observation is tested against different regulation strengths and different types of network of interaction as well. We also make comparisons with two recently proposed models to highlight the importance of the mechanism of adaptive aspiration level in supporting cooperation in structured populations

  7. Optimal Bidding and Operation of a Power Plant with Solvent-Based Carbon Capture under a CO2 Allowance Market: A Solution with a Reinforcement Learning-Based Sarsa Temporal-Difference Algorithm

    Directory of Open Access Journals (Sweden)

    Ziang Li

    2017-04-01

    Full Text Available In this paper, a reinforcement learning (RL-based Sarsa temporal-difference (TD algorithm is applied to search for a unified bidding and operation strategy for a coal-fired power plant with monoethanolamine (MEA-based post-combustion carbon capture under different carbon dioxide (CO2 allowance market conditions. The objective of the decision maker for the power plant is to maximize the discounted cumulative profit during the power plant lifetime. Two constraints are considered for the objective formulation. Firstly, the tradeoff between the energy-intensive carbon capture and the electricity generation should be made under presumed fixed fuel consumption. Secondly, the CO2 allowances purchased from the CO2 allowance market should be approximately equal to the quantity of CO2 emission from power generation. Three case studies are demonstrated thereafter. In the first case, we show the convergence of the Sarsa TD algorithm and find a deterministic optimal bidding and operation strategy. In the second case, compared with the independently designed operation and bidding strategies discussed in most of the relevant literature, the Sarsa TD-based unified bidding and operation strategy with time-varying flexible market-oriented CO2 capture levels is demonstrated to help the power plant decision maker gain a higher discounted cumulative profit. In the third case, a competitor operating another power plant identical to the preceding plant is considered under the same CO2 allowance market. The competitor also has carbon capture facilities but applies a different strategy to earn profits. The discounted cumulative profits of the two power plants are then compared, thus exhibiting the competitiveness of the power plant that is using the unified bidding and operation strategy explored by the Sarsa TD algorithm.

  8. Reinforcement Learning Based Artificial Immune Classifier

    Directory of Open Access Journals (Sweden)

    Mehmet Karakose

    2013-01-01

    Full Text Available One of the widely used methods for classification that is a decision-making process is artificial immune systems. Artificial immune systems based on natural immunity system can be successfully applied for classification, optimization, recognition, and learning in real-world problems. In this study, a reinforcement learning based artificial immune classifier is proposed as a new approach. This approach uses reinforcement learning to find better antibody with immune operators. The proposed new approach has many contributions according to other methods in the literature such as effectiveness, less memory cell, high accuracy, speed, and data adaptability. The performance of the proposed approach is demonstrated by simulation and experimental results using real data in Matlab and FPGA. Some benchmark data and remote image data are used for experimental results. The comparative results with supervised/unsupervised based artificial immune system, negative selection classifier, and resource limited artificial immune classifier are given to demonstrate the effectiveness of the proposed new method.

  9. Temporal difference learning for the game Tic-Tac-Toe 3D : applying structure to neural networks

    NARCIS (Netherlands)

    van de Steeg, M.; Drugan, M.M.; Wiering, M.

    2015-01-01

    When reinforcement learning is applied to large state spaces, such as those occurring in playing board games, the use of a good function approximator to learn to approximate the value function is very important. In previous research, multi-layer perceptrons have often been quite successfully used as

  10. Online reinforcement learning control for aerospace systems

    NARCIS (Netherlands)

    Zhou, Y.

    2018-01-01

    Reinforcement Learning (RL) methods are relatively new in the field of aerospace guidance, navigation, and control. This dissertation aims to exploit RL methods to improve the autonomy and online learning of aerospace systems with respect to the a priori unknown system and environment, dynamical

  11. Multi-agent machine learning a reinforcement approach

    CERN Document Server

    Schwartz, H M

    2014-01-01

    The book begins with a chapter on traditional methods of supervised learning, covering recursive least squares learning, mean square error methods, and stochastic approximation. Chapter 2 covers single agent reinforcement learning. Topics include learning value functions, Markov games, and TD learning with eligibility traces. Chapter 3 discusses two player games including two player matrix games with both pure and mixed strategies. Numerous algorithms and examples are presented. Chapter 4 covers learning in multi-player games, stochastic games, and Markov games, focusing on learning multi-pla

  12. Structure identification in fuzzy inference using reinforcement learning

    Science.gov (United States)

    Berenji, Hamid R.; Khedkar, Pratap

    1993-01-01

    In our previous work on the GARIC architecture, we have shown that the system can start with surface structure of the knowledge base (i.e., the linguistic expression of the rules) and learn the deep structure (i.e., the fuzzy membership functions of the labels used in the rules) by using reinforcement learning. Assuming the surface structure, GARIC refines the fuzzy membership functions used in the consequents of the rules using a gradient descent procedure. This hybrid fuzzy logic and reinforcement learning approach can learn to balance a cart-pole system and to backup a truck to its docking location after a few trials. In this paper, we discuss how to do structure identification using reinforcement learning in fuzzy inference systems. This involves identifying both surface as well as deep structure of the knowledge base. The term set of fuzzy linguistic labels used in describing the values of each control variable must be derived. In this process, splitting a label refers to creating new labels which are more granular than the original label and merging two labels creates a more general label. Splitting and merging of labels directly transform the structure of the action selection network used in GARIC by increasing or decreasing the number of hidden layer nodes.

  13. Reinforcement Learning Based on the Bayesian Theorem for Electricity Markets Decision Support

    DEFF Research Database (Denmark)

    Sousa, Tiago; Pinto, Tiago; Praca, Isabel

    2014-01-01

    This paper presents the applicability of a reinforcement learning algorithm based on the application of the Bayesian theorem of probability. The proposed reinforcement learning algorithm is an advantageous and indispensable tool for ALBidS (Adaptive Learning strategic Bidding System), a multi...

  14. Using a board game to reinforce learning.

    Science.gov (United States)

    Yoon, Bona; Rodriguez, Leslie; Faselis, Charles J; Liappis, Angelike P

    2014-03-01

    Experiential gaming strategies offer a variation on traditional learning. A board game was used to present synthesized content of fundamental catheter care concepts and reinforce evidence-based practices relevant to nursing. Board games are innovative educational tools that can enhance active learning. Copyright 2014, SLACK Incorporated.

  15. "Notice of Violation of IEEE Publication Principles" Multiobjective Reinforcement Learning: A Comprehensive Overview.

    Science.gov (United States)

    Liu, Chunming; Xu, Xin; Hu, Dewen

    2013-04-29

    Reinforcement learning is a powerful mechanism for enabling agents to learn in an unknown environment, and most reinforcement learning algorithms aim to maximize some numerical value, which represents only one long-term objective. However, multiple long-term objectives are exhibited in many real-world decision and control problems; therefore, recently, there has been growing interest in solving multiobjective reinforcement learning (MORL) problems with multiple conflicting objectives. The aim of this paper is to present a comprehensive overview of MORL. In this paper, the basic architecture, research topics, and naive solutions of MORL are introduced at first. Then, several representative MORL approaches and some important directions of recent research are reviewed. The relationships between MORL and other related research are also discussed, which include multiobjective optimization, hierarchical reinforcement learning, and multi-agent reinforcement learning. Finally, research challenges and open problems of MORL techniques are highlighted.

  16. Exploiting Best-Match Equations for Efficient Reinforcement Learning

    NARCIS (Netherlands)

    van Seijen, Harm; Whiteson, Shimon; van Hasselt, Hado; Wiering, Marco

    This article presents and evaluates best-match learning, a new approach to reinforcement learning that trades off the sample efficiency of model-based methods with the space efficiency of model-free methods. Best-match learning works by approximating the solution to a set of best-match equations,

  17. Tackling Error Propagation through Reinforcement Learning: A Case of Greedy Dependency Parsing

    OpenAIRE

    Le, Minh; Fokkens, Antske

    2017-01-01

    Error propagation is a common problem in NLP. Reinforcement learning explores erroneous states during training and can therefore be more robust when mistakes are made early in a process. In this paper, we apply reinforcement learning to greedy dependency parsing which is known to suffer from error propagation. Reinforcement learning improves accuracy of both labeled and unlabeled dependencies of the Stanford Neural Dependency Parser, a high performance greedy parser, while maintaining its eff...

  18. Longitudinal investigation on learned helplessness tested under negative and positive reinforcement involving stimulus control.

    Science.gov (United States)

    Oliveira, Emileane C; Hunziker, Maria Helena

    2014-07-01

    In this study, we investigated whether (a) animals demonstrating the learned helplessness effect during an escape contingency also show learning deficits under positive reinforcement contingencies involving stimulus control and (b) the exposure to positive reinforcement contingencies eliminates the learned helplessness effect under an escape contingency. Rats were initially exposed to controllable (C), uncontrollable (U) or no (N) shocks. After 24h, they were exposed to 60 escapable shocks delivered in a shuttlebox. In the following phase, we selected from each group the four subjects that presented the most typical group pattern: no escape learning (learned helplessness effect) in Group U and escape learning in Groups C and N. All subjects were then exposed to two phases, the (1) positive reinforcement for lever pressing under a multiple FR/Extinction schedule and (2) a re-test under negative reinforcement (escape). A fourth group (n=4) was exposed only to the positive reinforcement sessions. All subjects showed discrimination learning under multiple schedule. In the escape re-test, the learned helplessness effect was maintained for three of the animals in Group U. These results suggest that the learned helplessness effect did not extend to discriminative behavior that is positively reinforced and that the learned helplessness effect did not revert for most subjects after exposure to positive reinforcement. We discuss some theoretical implications as related to learned helplessness as an effect restricted to aversive contingencies and to the absence of reversion after positive reinforcement. This article is part of a Special Issue entitled: insert SI title. Copyright © 2014. Published by Elsevier B.V.

  19. Adaptive Trajectory Tracking Control using Reinforcement Learning for Quadrotor

    Directory of Open Access Journals (Sweden)

    Wenjie Lou

    2016-02-01

    Full Text Available Inaccurate system parameters and unpredicted external disturbances affect the performance of non-linear controllers. In this paper, a new adaptive control algorithm under the reinforcement framework is proposed to stabilize a quadrotor helicopter. Based on a command-filtered non-linear control algorithm, adaptive elements are added and learned by policy-search methods. To predict the inaccurate system parameters, a new kernel-based regression learning method is provided. In addition, Policy learning by Weighting Exploration with the Returns (PoWER and Return Weighted Regression (RWR are utilized to learn the appropriate parameters for adaptive elements in order to cancel the effect of external disturbance. Furthermore, numerical simulations under several conditions are performed, and the ability of adaptive trajectory-tracking control with reinforcement learning are demonstrated.

  20. Enriching behavioral ecology with reinforcement learning methods.

    Science.gov (United States)

    Frankenhuis, Willem E; Panchanathan, Karthik; Barto, Andrew G

    2018-02-13

    This article focuses on the division of labor between evolution and development in solving sequential, state-dependent decision problems. Currently, behavioral ecologists tend to use dynamic programming methods to study such problems. These methods are successful at predicting animal behavior in a variety of contexts. However, they depend on a distinct set of assumptions. Here, we argue that behavioral ecology will benefit from drawing more than it currently does on a complementary collection of tools, called reinforcement learning methods. These methods allow for the study of behavior in highly complex environments, which conventional dynamic programming methods do not feasibly address. In addition, reinforcement learning methods are well-suited to studying how biological mechanisms solve developmental and learning problems. For instance, we can use them to study simple rules that perform well in complex environments. Or to investigate under what conditions natural selection favors fixed, non-plastic traits (which do not vary across individuals), cue-driven-switch plasticity (innate instructions for adaptive behavioral development based on experience), or developmental selection (the incremental acquisition of adaptive behavior based on experience). If natural selection favors developmental selection, which includes learning from environmental feedback, we can also make predictions about the design of reward systems. Our paper is written in an accessible manner and for a broad audience, though we believe some novel insights can be drawn from our discussion. We hope our paper will help advance the emerging bridge connecting the fields of behavioral ecology and reinforcement learning. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.

  1. Reinforcement learning techniques for controlling resources in power networks

    Science.gov (United States)

    Kowli, Anupama Sunil

    As power grids transition towards increased reliance on renewable generation, energy storage and demand response resources, an effective control architecture is required to harness the full functionalities of these resources. There is a critical need for control techniques that recognize the unique characteristics of the different resources and exploit the flexibility afforded by them to provide ancillary services to the grid. The work presented in this dissertation addresses these needs. Specifically, new algorithms are proposed, which allow control synthesis in settings wherein the precise distribution of the uncertainty and its temporal statistics are not known. These algorithms are based on recent developments in Markov decision theory, approximate dynamic programming and reinforcement learning. They impose minimal assumptions on the system model and allow the control to be "learned" based on the actual dynamics of the system. Furthermore, they can accommodate complex constraints such as capacity and ramping limits on generation resources, state-of-charge constraints on storage resources, comfort-related limitations on demand response resources and power flow limits on transmission lines. Numerical studies demonstrating applications of these algorithms to practical control problems in power systems are discussed. Results demonstrate how the proposed control algorithms can be used to improve the performance and reduce the computational complexity of the economic dispatch mechanism in a power network. We argue that the proposed algorithms are eminently suitable to develop operational decision-making tools for large power grids with many resources and many sources of uncertainty.

  2. How we learn to make decisions: rapid propagation of reinforcement learning prediction errors in humans.

    Science.gov (United States)

    Krigolson, Olav E; Hassall, Cameron D; Handy, Todd C

    2014-03-01

    Our ability to make decisions is predicated upon our knowledge of the outcomes of the actions available to us. Reinforcement learning theory posits that actions followed by a reward or punishment acquire value through the computation of prediction errors-discrepancies between the predicted and the actual reward. A multitude of neuroimaging studies have demonstrated that rewards and punishments evoke neural responses that appear to reflect reinforcement learning prediction errors [e.g., Krigolson, O. E., Pierce, L. J., Holroyd, C. B., & Tanaka, J. W. Learning to become an expert: Reinforcement learning and the acquisition of perceptual expertise. Journal of Cognitive Neuroscience, 21, 1833-1840, 2009; Bayer, H. M., & Glimcher, P. W. Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron, 47, 129-141, 2005; O'Doherty, J. P. Reward representations and reward-related learning in the human brain: Insights from neuroimaging. Current Opinion in Neurobiology, 14, 769-776, 2004; Holroyd, C. B., & Coles, M. G. H. The neural basis of human error processing: Reinforcement learning, dopamine, and the error-related negativity. Psychological Review, 109, 679-709, 2002]. Here, we used the brain ERP technique to demonstrate that not only do rewards elicit a neural response akin to a prediction error but also that this signal rapidly diminished and propagated to the time of choice presentation with learning. Specifically, in a simple, learnable gambling task, we show that novel rewards elicited a feedback error-related negativity that rapidly decreased in amplitude with learning. Furthermore, we demonstrate the existence of a reward positivity at choice presentation, a previously unreported ERP component that has a similar timing and topography as the feedback error-related negativity that increased in amplitude with learning. The pattern of results we observed mirrored the output of a computational model that we implemented to compute reward

  3. Flow Navigation by Smart Microswimmers via Reinforcement Learning

    Science.gov (United States)

    Colabrese, Simona; Biferale, Luca; Celani, Antonio; Gustavsson, Kristian

    2017-11-01

    We have numerically modeled active particles which are able to acquire some limited knowledge of the fluid environment from simple mechanical cues and exert a control on their preferred steering direction. We show that those swimmers can learn effective strategies just by experience, using a reinforcement learning algorithm. As an example, we focus on smart gravitactic swimmers. These are active particles whose task is to reach the highest altitude within some time horizon, exploiting the underlying flow whenever possible. The reinforcement learning algorithm allows particles to learn effective strategies even in difficult situations when, in the absence of control, they would end up being trapped by flow structures. These strategies are highly nontrivial and cannot be easily guessed in advance. This work paves the way towards the engineering of smart microswimmers that solve difficult navigation problems. ERC AdG NewTURB 339032.

  4. The drift diffusion model as the choice rule in reinforcement learning.

    Science.gov (United States)

    Pedersen, Mads Lund; Frank, Michael J; Biele, Guido

    2017-08-01

    Current reinforcement-learning models often assume simplified decision processes that do not fully reflect the dynamic complexities of choice processes. Conversely, sequential-sampling models of decision making account for both choice accuracy and response time, but assume that decisions are based on static decision values. To combine these two computational models of decision making and learning, we implemented reinforcement-learning models in which the drift diffusion model describes the choice process, thereby capturing both within- and across-trial dynamics. To exemplify the utility of this approach, we quantitatively fit data from a common reinforcement-learning paradigm using hierarchical Bayesian parameter estimation, and compared model variants to determine whether they could capture the effects of stimulant medication in adult patients with attention-deficit hyperactivity disorder (ADHD). The model with the best relative fit provided a good description of the learning process, choices, and response times. A parameter recovery experiment showed that the hierarchical Bayesian modeling approach enabled accurate estimation of the model parameters. The model approach described here, using simultaneous estimation of reinforcement-learning and drift diffusion model parameters, shows promise for revealing new insights into the cognitive and neural mechanisms of learning and decision making, as well as the alteration of such processes in clinical groups.

  5. Embedded Incremental Feature Selection for Reinforcement Learning

    Science.gov (United States)

    2012-05-01

    Prior to this work, feature selection for reinforce- ment learning has focused on linear value function ap- proximation ( Kolter and Ng, 2009; Parr et al...InProceed- ings of the the 23rd International Conference on Ma- chine Learning, pages 449–456. Kolter , J. Z. and Ng, A. Y. (2009). Regularization and feature

  6. Flexible Heuristic Dynamic Programming for Reinforcement Learning in Quadrotors

    NARCIS (Netherlands)

    Helmer, Alexander; de Visser, C.C.; van Kampen, E.

    2018-01-01

    Reinforcement learning is a paradigm for learning decision-making tasks from interaction with the environment. Function approximators solve a part of the curse of dimensionality when learning in high-dimensional state and/or action spaces. It can be a time-consuming process to learn a good policy in

  7. Working Memory and Reinforcement Schedule Jointly Determine Reinforcement Learning in Children: Potential Implications for Behavioral Parent Training

    Directory of Open Access Journals (Sweden)

    Elien Segers

    2018-03-01

    Full Text Available Introduction: Behavioral Parent Training (BPT is often provided for childhood psychiatric disorders. These disorders have been shown to be associated with working memory impairments. BPT is based on operant learning principles, yet how operant principles shape behavior (through the partial reinforcement (PRF extinction effect, i.e., greater resistance to extinction that is created when behavior is reinforced partially rather than continuously and the potential role of working memory therein is scarcely studied in children. This study explored the PRF extinction effect and the role of working memory therein using experimental tasks in typically developing children.Methods: Ninety-seven children (age 6–10 completed a working memory task and an operant learning task, in which children acquired a response-sequence rule under either continuous or PRF (120 trials, followed by an extinction phase (80 trials. Data of 88 children were used for analysis.Results: The PRF extinction effect was confirmed: We observed slower acquisition and extinction in the PRF condition as compared to the continuous reinforcement (CRF condition. Working memory was negatively related to acquisition but not extinction performance.Conclusion: Both reinforcement contingencies and working memory relate to acquisition performance. Potential implications for BPT are that decreasing working memory load may enhance the chance of optimally learning through reinforcement.

  8. Reinforcement learning on slow features of high-dimensional input streams.

    Directory of Open Access Journals (Sweden)

    Robert Legenstein

    Full Text Available Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.

  9. Reinforcement learning: Solving two case studies

    Science.gov (United States)

    Duarte, Ana Filipa; Silva, Pedro; dos Santos, Cristina Peixoto

    2012-09-01

    Reinforcement Learning algorithms offer interesting features for the control of autonomous systems, such as the ability to learn from direct interaction with the environment, and the use of a simple reward signalas opposed to the input-outputs pairsused in classic supervised learning. The reward signal indicates the success of failure of the actions executed by the agent in the environment. In this work, are described RL algorithmsapplied to two case studies: the Crawler robot and the widely known inverted pendulum. We explore RL capabilities to autonomously learn a basic locomotion pattern in the Crawler, andapproach the balancing problem of biped locomotion using the inverted pendulum.

  10. Systems control with generalized probabilistic fuzzy-reinforcement learning

    NARCIS (Netherlands)

    Hinojosa, J.; Nefti, S.; Kaymak, U.

    2011-01-01

    Reinforcement learning (RL) is a valuable learning method when the systems require a selection of control actions whose consequences emerge over long periods for which input-output data are not available. In most combinations of fuzzy systems and RL, the environment is considered to be

  11. Efficient abstraction selection in reinforcement learning

    NARCIS (Netherlands)

    Seijen, H. van; Whiteson, S.; Kester, L.

    2013-01-01

    This paper introduces a novel approach for abstraction selection in reinforcement learning problems modelled as factored Markov decision processes (MDPs), for which a state is described via a set of state components. In abstraction selection, an agent must choose an abstraction from a set of

  12. Temporal maps and informativeness in associative learning.

    Science.gov (United States)

    Balsam, Peter D; Gallistel, C Randy

    2009-02-01

    Neurobiological research on learning assumes that temporal contiguity is essential for association formation, but what constitutes temporal contiguity has never been specified. We review evidence that learning depends, instead, on learning a temporal map. Temporal relations between events are encoded even from single experiences. The speed with which an anticipatory response emerges is proportional to the informativeness of the encoded relation between a predictive stimulus or event and the event it predicts. This principle yields a quantitative account of the heretofore undefined, but theoretically crucial, concept of temporal pairing, an account in quantitative accord with surprising experimental findings. The same principle explains the basic results in the cue competition literature, which motivated the Rescorla-Wagner model and most other contemporary models of associative learning. The essential feature of a memory mechanism in this account is its ability to encode quantitative information.

  13. Learning alternative movement coordination patterns using reinforcement feedback.

    Science.gov (United States)

    Lin, Tzu-Hsiang; Denomme, Amber; Ranganathan, Rajiv

    2018-05-01

    One of the characteristic features of the human motor system is redundancy-i.e., the ability to achieve a given task outcome using multiple coordination patterns. However, once participants settle on using a specific coordination pattern, the process of learning to use a new alternative coordination pattern to perform the same task is still poorly understood. Here, using two experiments, we examined this process of how participants shift from one coordination pattern to another using different reinforcement schedules. Participants performed a virtual reaching task, where they moved a cursor to different targets positioned on the screen. Our goal was to make participants use a coordination pattern with greater trunk motion, and to this end, we provided reinforcement by making the cursor disappear if the trunk motion during the reach did not cross a specified threshold value. In Experiment 1, we compared two reinforcement schedules in two groups of participants-an abrupt group, where the threshold was introduced immediately at the beginning of practice; and a gradual group, where the threshold was introduced gradually with practice. Results showed that both abrupt and gradual groups were effective in shifting their coordination patterns to involve greater trunk motion, but the abrupt group showed greater retention when the reinforcement was removed. In Experiment 2, we examined the basis of this advantage in the abrupt group using two additional control groups. Results showed that the advantage of the abrupt group was because of a greater number of practice trials with the desired coordination pattern. Overall, these results show that reinforcement can be successfully used to shift coordination patterns, which has potential in the rehabilitation of movement disorders.

  14. Explicit and implicit reinforcement learning across the psychosis spectrum.

    Science.gov (United States)

    Barch, Deanna M; Carter, Cameron S; Gold, James M; Johnson, Sheri L; Kring, Ann M; MacDonald, Angus W; Pizzagalli, Diego A; Ragland, J Daniel; Silverstein, Steven M; Strauss, Milton E

    2017-07-01

    Motivational and hedonic impairments are core features of a variety of types of psychopathology. An important aspect of motivational function is reinforcement learning (RL), including implicit (i.e., outside of conscious awareness) and explicit (i.e., including explicit representations about potential reward associations) learning, as well as both positive reinforcement (learning about actions that lead to reward) and punishment (learning to avoid actions that lead to loss). Here we present data from paradigms designed to assess both positive and negative components of both implicit and explicit RL, examine performance on each of these tasks among individuals with schizophrenia, schizoaffective disorder, and bipolar disorder with psychosis, and examine their relative relationships to specific symptom domains transdiagnostically. None of the diagnostic groups differed significantly from controls on the implicit RL tasks in either bias toward a rewarded response or bias away from a punished response. However, on the explicit RL task, both the individuals with schizophrenia and schizoaffective disorder performed significantly worse than controls, but the individuals with bipolar did not. Worse performance on the explicit RL task, but not the implicit RL task, was related to worse motivation and pleasure symptoms across all diagnostic categories. Performance on explicit RL, but not implicit RL, was related to working memory, which accounted for some of the diagnostic group differences. However, working memory did not account for the relationship of explicit RL to motivation and pleasure symptoms. These findings suggest transdiagnostic relationships across the spectrum of psychotic disorders between motivation and pleasure impairments and explicit RL. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  15. Advances in Temporal Analysis in Learning and Instruction

    Science.gov (United States)

    Molenaar, Inge

    2014-01-01

    This paper focuses on a trend to analyse temporal characteristics of constructs important to learning and instruction. Different researchers have indicated that we should pay more attention to time in our research to enhance explanatory power and increase validity. Constructs formerly viewed as personal traits, such as self-regulated learning and…

  16. Integral reinforcement learning for continuous-time input-affine nonlinear systems with simultaneous invariant explorations.

    Science.gov (United States)

    Lee, Jae Young; Park, Jin Bae; Choi, Yoon Ho

    2015-05-01

    This paper focuses on a class of reinforcement learning (RL) algorithms, named integral RL (I-RL), that solve continuous-time (CT) nonlinear optimal control problems with input-affine system dynamics. First, we extend the concepts of exploration, integral temporal difference, and invariant admissibility to the target CT nonlinear system that is governed by a control policy plus a probing signal called an exploration. Then, we show input-to-state stability (ISS) and invariant admissibility of the closed-loop systems with the policies generated by integral policy iteration (I-PI) or invariantly admissible PI (IA-PI) method. Based on these, three online I-RL algorithms named explorized I-PI and integral Q -learning I, II are proposed, all of which generate the same convergent sequences as I-PI and IA-PI under the required excitation condition on the exploration. All the proposed methods are partially or completely model free, and can simultaneously explore the state space in a stable manner during the online learning processes. ISS, invariant admissibility, and convergence properties of the proposed methods are also investigated, and related with these, we show the design principles of the exploration for safe learning. Neural-network-based implementation methods for the proposed schemes are also presented in this paper. Finally, several numerical simulations are carried out to verify the effectiveness of the proposed methods.

  17. Cardiac Concomitants of Feedback and Prediction Error Processing in Reinforcement Learning

    Science.gov (United States)

    Kastner, Lucas; Kube, Jana; Villringer, Arno; Neumann, Jane

    2017-01-01

    Successful learning hinges on the evaluation of positive and negative feedback. We assessed differential learning from reward and punishment in a monetary reinforcement learning paradigm, together with cardiac concomitants of positive and negative feedback processing. On the behavioral level, learning from reward resulted in more advantageous behavior than learning from punishment, suggesting a differential impact of reward and punishment on successful feedback-based learning. On the autonomic level, learning and feedback processing were closely mirrored by phasic cardiac responses on a trial-by-trial basis: (1) Negative feedback was accompanied by faster and prolonged heart rate deceleration compared to positive feedback. (2) Cardiac responses shifted from feedback presentation at the beginning of learning to stimulus presentation later on. (3) Most importantly, the strength of phasic cardiac responses to the presentation of feedback correlated with the strength of prediction error signals that alert the learner to the necessity for behavioral adaptation. Considering participants' weight status and gender revealed obesity-related deficits in learning to avoid negative consequences and less consistent behavioral adaptation in women compared to men. In sum, our results provide strong new evidence for the notion that during learning phasic cardiac responses reflect an internal value and feedback monitoring system that is sensitive to the violation of performance-based expectations. Moreover, inter-individual differences in weight status and gender may affect both behavioral and autonomic responses in reinforcement-based learning. PMID:29163004

  18. Cardiac Concomitants of Feedback and Prediction Error Processing in Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Lucas Kastner

    2017-10-01

    Full Text Available Successful learning hinges on the evaluation of positive and negative feedback. We assessed differential learning from reward and punishment in a monetary reinforcement learning paradigm, together with cardiac concomitants of positive and negative feedback processing. On the behavioral level, learning from reward resulted in more advantageous behavior than learning from punishment, suggesting a differential impact of reward and punishment on successful feedback-based learning. On the autonomic level, learning and feedback processing were closely mirrored by phasic cardiac responses on a trial-by-trial basis: (1 Negative feedback was accompanied by faster and prolonged heart rate deceleration compared to positive feedback. (2 Cardiac responses shifted from feedback presentation at the beginning of learning to stimulus presentation later on. (3 Most importantly, the strength of phasic cardiac responses to the presentation of feedback correlated with the strength of prediction error signals that alert the learner to the necessity for behavioral adaptation. Considering participants' weight status and gender revealed obesity-related deficits in learning to avoid negative consequences and less consistent behavioral adaptation in women compared to men. In sum, our results provide strong new evidence for the notion that during learning phasic cardiac responses reflect an internal value and feedback monitoring system that is sensitive to the violation of performance-based expectations. Moreover, inter-individual differences in weight status and gender may affect both behavioral and autonomic responses in reinforcement-based learning.

  19. Reinforcement Learning in the Game of Othello: Learning Against a Fixed Opponent and Learning from Self-Play

    NARCIS (Netherlands)

    van der Ree, Michiel; Wiering, Marco

    2013-01-01

    This paper compares three strategies in using reinforcement learning algorithms to let an artificial agent learnto play the game of Othello. The three strategies that are compared are: Learning by self-play, learning from playing against a fixed opponent, and learning from playing against a fixed

  20. Influence of temporal context on value in the multiple-chains and successive-encounters procedures.

    Science.gov (United States)

    O'Daly, Matthew; Angulo, Samuel; Gipson, Cassandra; Fantino, Edmund

    2006-05-01

    This set of studies explored the influence of temporal context across multiple-chain and multiple-successive-encounters procedures. Following training with different temporal contexts, the value of stimuli sharing similar reinforcement schedules was assessed by presenting these stimuli in concurrent probes. The results for the multiple-chain schedule indicate that temporal context does impact the value of a conditioned reinforcer consistent with delay-reduction theory, such that a stimulus signaling a greater reduction in delay until reinforcement has greater value. Further, nonreinforced stimuli that are concurrently presented with the preferred terminal link also have greater value, consistent with value transfer. The effects of context on value for conditions with the multiple-successive-encounters procedure, however, appear to depend on whether the search schedule or alternate handling schedule was manipulated, as well as on whether the tested stimuli were the rich or lean schedules in their components. Overall, the results help delineate the conditions under which temporal context affects conditioned-reinforcement value (acting as a learning variable) and the conditions under which it does not (acting as a performance variable), an issue of relevance to theories of choice.

  1. Time representation in reinforcement learning models of the basal ganglia

    Directory of Open Access Journals (Sweden)

    Samuel Joseph Gershman

    2014-01-01

    Full Text Available Reinforcement learning models have been influential in understanding many aspects of basal ganglia function, from reward prediction to action selection. Time plays an important role in these models, but there is still no theoretical consensus about what kind of time representation is used by the basal ganglia. We review several theoretical accounts and their supporting evidence. We then discuss the relationship between reinforcement learning models and the timing mechanisms that have been attributed to the basal ganglia. We hypothesize that a single computational system may underlie both reinforcement learning and interval timing—the perception of duration in the range of seconds to hours. This hypothesis, which extends earlier models by incorporating a time-sensitive action selection mechanism, may have important implications for understanding disorders like Parkinson's disease in which both decision making and timing are impaired.

  2. Safe Exploration of State and Action Spaces in Reinforcement Learning

    OpenAIRE

    Garcia, Javier; Fernandez, Fernando

    2014-01-01

    In this paper, we consider the important problem of safe exploration in reinforcement learning. While reinforcement learning is well-suited to domains with complex transition dynamics and high-dimensional state-action spaces, an additional challenge is posed by the need for safe and efficient exploration. Traditional exploration techniques are not particularly useful for solving dangerous tasks, where the trial and error process may lead to the selection of actions whose execution in some sta...

  3. Adversarial Reinforcement Learning in a Cyber Security Simulation}

    OpenAIRE

    Elderman, Richard; Pater, Leon; Thie, Albert; Drugan, Madalina; Wiering, Marco

    2017-01-01

    This paper focuses on cyber-security simulations in networks modeled as a Markov game with incomplete information and stochastic elements. The resulting game is an adversarial sequential decision making problem played with two agents, the attacker and defender. The two agents pit one reinforcement learning technique, like neural networks, Monte Carlo learning and Q-learning, against each other and examine their effectiveness against learning opponents. The results showed that Monte Carlo lear...

  4. Reinforcement learning for dpm of embedded visual sensor nodes

    International Nuclear Information System (INIS)

    Khani, U.; Sadhayo, I. H.

    2014-01-01

    This paper proposes a RL (Reinforcement Learning) based DPM (Dynamic Power Management) technique to learn time out policies during a visual sensor node's operation which has multiple power/performance states. As opposed to the widely used static time out policies, our proposed DPM policy which is also referred to as OLTP (Online Learning of Time out Policies), learns to dynamically change the time out decisions in the different node states including the non-operational states. The selection of time out values in different power/performance states of a visual sensing platform is based on the workload estimates derived from a ML-ANN (Multi-Layer Artificial Neural Network) and an objective function given by weighted performance and power parameters. The DPM approach is also able to dynamically adjust the power-performance weights online to satisfy a given constraint of either power consumption or performance. Results show that the proposed learning algorithm explores the power-performance tradeoff with non-stationary workload and outperforms other DPM policies. It also performs the online adjustment of the tradeoff parameters in order to meet a user-specified constraint. (author)

  5. Tackling Error Propagation through Reinforcement Learning: A Case of Greedy Dependency Parsing

    NARCIS (Netherlands)

    Le, M.N.; Fokkens, A.S.

    Error propagation is a common problem in NLP. Reinforcement learning explores erroneous states during training and can therefore be more robust when mistakes are made early in a process. In this paper, we apply reinforcement learning to greedy dependency parsing which is known to suffer from error

  6. Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making.

    Science.gov (United States)

    Schönberg, Tom; Daw, Nathaniel D; Joel, Daphna; O'Doherty, John P

    2007-11-21

    The computational framework of reinforcement learning has been used to forward our understanding of the neural mechanisms underlying reward learning and decision-making behavior. It is known that humans vary widely in their performance in decision-making tasks. Here, we used a simple four-armed bandit task in which subjects are almost evenly split into two groups on the basis of their performance: those who do learn to favor choice of the optimal action and those who do not. Using models of reinforcement learning we sought to determine the neural basis of these intrinsic differences in performance by scanning both groups with functional magnetic resonance imaging. We scanned 29 subjects while they performed the reward-based decision-making task. Our results suggest that these two groups differ markedly in the degree to which reinforcement learning signals in the striatum are engaged during task performance. While the learners showed robust prediction error signals in both the ventral and dorsal striatum during learning, the nonlearner group showed a marked absence of such signals. Moreover, the magnitude of prediction error signals in a region of dorsal striatum correlated significantly with a measure of behavioral performance across all subjects. These findings support a crucial role of prediction error signals, likely originating from dopaminergic midbrain neurons, in enabling learning of action selection preferences on the basis of obtained rewards. Thus, spontaneously observed individual differences in decision making performance demonstrate the suggested dependence of this type of learning on the functional integrity of the dopaminergic striatal system in humans.

  7. Reinforcement Learning for a New Piano Mover

    Directory of Open Access Journals (Sweden)

    Yuko Ishiwaka

    2005-08-01

    Full Text Available We attempt to achieve corporative behavior of autonomous decentralized agents constructed via Q-Learning, which is a type of reinforcement learning. As such, in the present paper, we examine the piano mover's problem. We propose a multi-agent architecture that has a training agent, learning agents and intermediate agent. Learning agents are heterogeneous and can communicate with each other. The movement of an object with three kinds of agent depends on the composition of the actions of the learning agents. By learning its own shape through the learning agents, avoidance of obstacles by the object is expected. We simulate the proposed method in a two-dimensional continuous world. Results obtained in the present investigation reveal the effectiveness of the proposed method.

  8. Applications of Deep Learning and Reinforcement Learning to Biological Data.

    Science.gov (United States)

    Mahmud, Mufti; Kaiser, Mohammed Shamim; Hussain, Amir; Vassanelli, Stefano

    2018-06-01

    Rapid advances in hardware-based technologies during the past decades have opened up new possibilities for life scientists to gather multimodal data in various application domains, such as omics, bioimaging, medical imaging, and (brain/body)-machine interfaces. These have generated novel opportunities for development of dedicated data-intensive machine learning techniques. In particular, recent research in deep learning (DL), reinforcement learning (RL), and their combination (deep RL) promise to revolutionize the future of artificial intelligence. The growth in computational power accompanied by faster and increased data storage, and declining computing costs have already allowed scientists in various fields to apply these techniques on data sets that were previously intractable owing to their size and complexity. This paper provides a comprehensive survey on the application of DL, RL, and deep RL techniques in mining biological data. In addition, we compare the performances of DL techniques when applied to different data sets across various application domains. Finally, we outline open issues in this challenging research area and discuss future development perspectives.

  9. Place preference and vocal learning rely on distinct reinforcers in songbirds.

    Science.gov (United States)

    Murdoch, Don; Chen, Ruidong; Goldberg, Jesse H

    2018-04-30

    In reinforcement learning (RL) agents are typically tasked with maximizing a single objective function such as reward. But it remains poorly understood how agents might pursue distinct objectives at once. In machines, multiobjective RL can be achieved by dividing a single agent into multiple sub-agents, each of which is shaped by agent-specific reinforcement, but it remains unknown if animals adopt this strategy. Here we use songbirds to test if navigation and singing, two behaviors with distinct objectives, can be differentially reinforced. We demonstrate that strobe flashes aversively condition place preference but not song syllables. Brief noise bursts aversively condition song syllables but positively reinforce place preference. Thus distinct behavior-generating systems, or agencies, within a single animal can be shaped by correspondingly distinct reinforcement signals. Our findings suggest that spatially segregated vocal circuits can solve a credit assignment problem associated with multiobjective learning.

  10. High and low temperatures have unequal reinforcing properties in Drosophila spatial learning.

    Science.gov (United States)

    Zars, Melissa; Zars, Troy

    2006-07-01

    Small insects regulate their body temperature solely through behavior. Thus, sensing environmental temperature and implementing an appropriate behavioral strategy can be critical for survival. The fly Drosophila melanogaster prefers 24 degrees C, avoiding higher and lower temperatures when tested on a temperature gradient. Furthermore, temperatures above 24 degrees C have negative reinforcing properties. In contrast, we found that flies have a preference in operant learning experiments for a low-temperature-associated position rather than the 24 degrees C alternative in the heat-box. Two additional differences between high- and low-temperature reinforcement, i.e., temperatures above and below 24 degrees C, were found. Temperatures equally above and below 24 degrees C did not reinforce equally and only high temperatures supported increased memory performance with reversal conditioning. Finally, low- and high-temperature reinforced memories are similarly sensitive to two genetic mutations. Together these results indicate the qualitative meaning of temperatures below 24 degrees C depends on the dynamics of the temperatures encountered and that the reinforcing effects of these temperatures depend on at least some common genetic components. Conceptualizing these results using the Wolf-Heisenberg model of operant conditioning, we propose the maximum difference in experienced temperatures determines the magnitude of the reinforcement input to a conditioning circuit.

  11. Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain.

    Science.gov (United States)

    Niv, Yael; Edlund, Jeffrey A; Dayan, Peter; O'Doherty, John P

    2012-01-11

    Humans and animals are exquisitely, though idiosyncratically, sensitive to risk or variance in the outcomes of their actions. Economic, psychological, and neural aspects of this are well studied when information about risk is provided explicitly. However, we must normally learn about outcomes from experience, through trial and error. Traditional models of such reinforcement learning focus on learning about the mean reward value of cues and ignore higher order moments such as variance. We used fMRI to test whether the neural correlates of human reinforcement learning are sensitive to experienced risk. Our analysis focused on anatomically delineated regions of a priori interest in the nucleus accumbens, where blood oxygenation level-dependent (BOLD) signals have been suggested as correlating with quantities derived from reinforcement learning. We first provide unbiased evidence that the raw BOLD signal in these regions corresponds closely to a reward prediction error. We then derive from this signal the learned values of cues that predict rewards of equal mean but different variance and show that these values are indeed modulated by experienced risk. Moreover, a close neurometric-psychometric coupling exists between the fluctuations of the experience-based evaluations of risky options that we measured neurally and the fluctuations in behavioral risk aversion. This suggests that risk sensitivity is integral to human learning, illuminating economic models of choice, neuroscientific models of affective learning, and the workings of the underlying neural mechanisms.

  12. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning.

    Science.gov (United States)

    Pilarski, Patrick M; Dawson, Michael R; Degris, Thomas; Fahimi, Farbod; Carey, Jason P; Sutton, Richard S

    2011-01-01

    As a contribution toward the goal of adaptable, intelligent artificial limbs, this work introduces a continuous actor-critic reinforcement learning method for optimizing the control of multi-function myoelectric devices. Using a simulated upper-arm robotic prosthesis, we demonstrate how it is possible to derive successful limb controllers from myoelectric data using only a sparse human-delivered training signal, without requiring detailed knowledge about the task domain. This reinforcement-based machine learning framework is well suited for use by both patients and clinical staff, and may be easily adapted to different application domains and the needs of individual amputees. To our knowledge, this is the first my-oelectric control approach that facilitates the online learning of new amputee-specific motions based only on a one-dimensional (scalar) feedback signal provided by the user of the prosthesis. © 2011 IEEE

  13. Sex Differences in Reinforcement and Punishment on Prime-Time Television.

    Science.gov (United States)

    Downs, A. Chris; Gowan, Darryl C.

    1980-01-01

    Television programs were analyzed for frequencies of positive reinforcement and punishment exchanged among performers varying in age and sex. Females were found to more often exhibit and receive reinforcement, whereas males more often exhibited and received punishment. These findings have implications for children's learning of positive and…

  14. Joint Extraction of Entities and Relations Using Reinforcement Learning and Deep Learning

    Directory of Open Access Journals (Sweden)

    Yuntian Feng

    2017-01-01

    Full Text Available We use both reinforcement learning and deep learning to simultaneously extract entities and relations from unstructured texts. For reinforcement learning, we model the task as a two-step decision process. Deep learning is used to automatically capture the most important information from unstructured texts, which represent the state in the decision process. By designing the reward function per step, our proposed method can pass the information of entity extraction to relation extraction and obtain feedback in order to extract entities and relations simultaneously. Firstly, we use bidirectional LSTM to model the context information, which realizes preliminary entity extraction. On the basis of the extraction results, attention based method can represent the sentences that include target entity pair to generate the initial state in the decision process. Then we use Tree-LSTM to represent relation mentions to generate the transition state in the decision process. Finally, we employ Q-Learning algorithm to get control policy π in the two-step decision process. Experiments on ACE2005 demonstrate that our method attains better performance than the state-of-the-art method and gets a 2.4% increase in recall-score.

  15. Joint Extraction of Entities and Relations Using Reinforcement Learning and Deep Learning.

    Science.gov (United States)

    Feng, Yuntian; Zhang, Hongjun; Hao, Wenning; Chen, Gang

    2017-01-01

    We use both reinforcement learning and deep learning to simultaneously extract entities and relations from unstructured texts. For reinforcement learning, we model the task as a two-step decision process. Deep learning is used to automatically capture the most important information from unstructured texts, which represent the state in the decision process. By designing the reward function per step, our proposed method can pass the information of entity extraction to relation extraction and obtain feedback in order to extract entities and relations simultaneously. Firstly, we use bidirectional LSTM to model the context information, which realizes preliminary entity extraction. On the basis of the extraction results, attention based method can represent the sentences that include target entity pair to generate the initial state in the decision process. Then we use Tree-LSTM to represent relation mentions to generate the transition state in the decision process. Finally, we employ Q -Learning algorithm to get control policy π in the two-step decision process. Experiments on ACE2005 demonstrate that our method attains better performance than the state-of-the-art method and gets a 2.4% increase in recall-score.

  16. An Analysis of Stochastic Game Theory for Multiagent Reinforcement Learning

    National Research Council Canada - National Science Library

    Bowling, Michael

    2000-01-01

    .... In this paper we contribute a comprehensive presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities. We examine the assumptions and limitations of these algorithms, and identify similarities between these algorithms, single agent reinforcement learners, and basic game theory techniques.

  17. Reinforcement function design and bias for efficient learning in mobile robots

    International Nuclear Information System (INIS)

    Touzet, C.; Santos, J.M.

    1998-01-01

    The main paradigm in sub-symbolic learning robot domain is the reinforcement learning method. Various techniques have been developed to deal with the memorization/generalization problem, demonstrating the superior ability of artificial neural network implementations. In this paper, the authors address the issue of designing the reinforcement so as to optimize the exploration part of the learning. They also present and summarize works relative to the use of bias intended to achieve the effective synthesis of the desired behavior. Demonstrative experiments involving a self-organizing map implementation of the Q-learning and real mobile robots (Nomad 200 and Khepera) in a task of obstacle avoidance behavior synthesis are described. 3 figs., 5 tabs

  18. AN EXTENDED REINFORCEMENT LEARNING MODEL OF BASAL GANGLIA TO UNDERSTAND THE CONTRIBUTIONS OF SEROTONIN AND DOPAMINE IN RISK-BASED DECISION MAKING, REWARD PREDICTION, AND PUNISHMENT LEARNING

    Directory of Open Access Journals (Sweden)

    Pragathi Priyadharsini Balasubramani

    2014-04-01

    Full Text Available Although empirical and neural studies show that serotonin (5HT plays many functional roles in the brain, prior computational models mostly focus on its role in behavioral inhibition. In this study, we present a model of risk based decision making in a modified Reinforcement Learning (RL-framework. The model depicts the roles of dopamine (DA and serotonin (5HT in Basal Ganglia (BG. In this model, the DA signal is represented by the temporal difference error (δ, while the 5HT signal is represented by a parameter (α that controls risk prediction error. This formulation that accommodates both 5HT and DA reconciles some of the diverse roles of 5HT particularly in connection with the BG system. We apply the model to different experimental paradigms used to study the role of 5HT: 1 Risk-sensitive decision making, where 5HT controls risk assessment, 2 Temporal reward prediction, where 5HT controls time-scale of reward prediction, and 3 Reward/Punishment sensitivity, in which the punishment prediction error depends on 5HT levels. Thus the proposed integrated RL model reconciles several existing theories of 5HT and DA in the BG.

  19. Towards autonomous neuroprosthetic control using Hebbian reinforcement learning.

    Science.gov (United States)

    Mahmoudi, Babak; Pohlmeyer, Eric A; Prins, Noeline W; Geng, Shijia; Sanchez, Justin C

    2013-12-01

    Our goal was to design an adaptive neuroprosthetic controller that could learn the mapping from neural states to prosthetic actions and automatically adjust adaptation using only a binary evaluative feedback as a measure of desirability/undesirability of performance. Hebbian reinforcement learning (HRL) in a connectionist network was used for the design of the adaptive controller. The method combines the efficiency of supervised learning with the generality of reinforcement learning. The convergence properties of this approach were studied using both closed-loop control simulations and open-loop simulations that used primate neural data from robot-assisted reaching tasks. The HRL controller was able to perform classification and regression tasks using its episodic and sequential learning modes, respectively. In our experiments, the HRL controller quickly achieved convergence to an effective control policy, followed by robust performance. The controller also automatically stopped adapting the parameters after converging to a satisfactory control policy. Additionally, when the input neural vector was reorganized, the controller resumed adaptation to maintain performance. By estimating an evaluative feedback directly from the user, the HRL control algorithm may provide an efficient method for autonomous adaptation of neuroprosthetic systems. This method may enable the user to teach the controller the desired behavior using only a simple feedback signal.

  20. Neurofeedback in Learning Disabled Children: Visual versus Auditory Reinforcement.

    Science.gov (United States)

    Fernández, Thalía; Bosch-Bayard, Jorge; Harmony, Thalía; Caballero, María I; Díaz-Comas, Lourdes; Galán, Lídice; Ricardo-Garcell, Josefina; Aubert, Eduardo; Otero-Ojeda, Gloria

    2016-03-01

    Children with learning disabilities (LD) frequently have an EEG characterized by an excess of theta and a deficit of alpha activities. NFB using an auditory stimulus as reinforcer has proven to be a useful tool to treat LD children by positively reinforcing decreases of the theta/alpha ratio. The aim of the present study was to optimize the NFB procedure by comparing the efficacy of visual (with eyes open) versus auditory (with eyes closed) reinforcers. Twenty LD children with an abnormally high theta/alpha ratio were randomly assigned to the Auditory or the Visual group, where a 500 Hz tone or a visual stimulus (a white square), respectively, was used as a positive reinforcer when the value of the theta/alpha ratio was reduced. Both groups had signs consistent with EEG maturation, but only the Auditory Group showed behavioral/cognitive improvements. In conclusion, the auditory reinforcer was more efficacious in reducing the theta/alpha ratio, and it improved the cognitive abilities more than the visual reinforcer.

  1. Intranasal oxytocin enhances socially-reinforced learning in rhesus monkeys

    Directory of Open Access Journals (Sweden)

    Lisa A Parr

    2014-09-01

    Full Text Available There are currently no drugs approved for the treatment of social deficits associated with autism spectrum disorders (ASD. One hypothesis for these deficits is that individuals with ASD lack the motivation to attend to social cues because those cues are not implicitly rewarding. Therefore, any drug that could enhance the rewarding quality of social stimuli could have a profound impact on the treatment of ASD, and other social disorders. Oxytocin (OT is a neuropeptide that has been effective in enhancing social cognition and social reward in humans. The present study examined the ability of OT to selectively enhance learning after social compared to nonsocial reward in rhesus monkeys, an important species for modeling the neurobiology of social behavior in humans. Monkeys were required to learn an implicit visual matching task after receiving either intranasal (IN OT or Placebo (saline. Correct trials were rewarded with the presentation of positive and negative social (play faces/threat faces or nonsocial (banana/cage locks stimuli, plus food. Incorrect trials were not rewarded. Results demonstrated a strong effect of socially-reinforced learning, monkeys’ performed significantly better when reinforced with social versus nonsocial stimuli. Additionally, socially-reinforced learning was significantly better and occurred faster after IN-OT compared to placebo treatment. Performance in the IN-OT, but not Placebo, condition was also significantly better when the reinforcement stimuli were emotionally positive compared to negative facial expressions. These data support the hypothesis that OT may function to enhance prosocial behavior in primates by increasing the rewarding quality of emotionally positive, social compared to emotionally negative or nonsocial images. These data also support the use of the rhesus monkey as a model for exploring the neurobiological basis of social behavior and its impairment.

  2. applying reinforcement learning to the weapon assignment problem

    African Journals Online (AJOL)

    ismith

    Carlo (MC) control algorithm with exploring starts (MCES), and an off-policy ..... closest to the threat should fire (that weapon also had the highest probability to ... Monte Carlo ..... “Reinforcement learning: Theory, methods and application to.

  3. Reinforcement and Systemic Machine Learning for Decision Making

    CERN Document Server

    Kulkarni, Parag

    2012-01-01

    Reinforcement and Systemic Machine Learning for Decision Making There are always difficulties in making machines that learn from experience. Complete information is not always available-or it becomes available in bits and pieces over a period of time. With respect to systemic learning, there is a need to understand the impact of decisions and actions on a system over that period of time. This book takes a holistic approach to addressing that need and presents a new paradigm-creating new learning applications and, ultimately, more intelligent machines. The first book of its kind in this new an

  4. Reinforcement active learning in the vibrissae system: optimal object localization.

    Science.gov (United States)

    Gordon, Goren; Dorfman, Nimrod; Ahissar, Ehud

    2013-01-01

    Rats move their whiskers to acquire information about their environment. It has been observed that they palpate novel objects and objects they are required to localize in space. We analyze whisker-based object localization using two complementary paradigms, namely, active learning and intrinsic-reward reinforcement learning. Active learning algorithms select the next training samples according to the hypothesized solution in order to better discriminate between correct and incorrect labels. Intrinsic-reward reinforcement learning uses prediction errors as the reward to an actor-critic design, such that behavior converges to the one that optimizes the learning process. We show that in the context of object localization, the two paradigms result in palpation whisking as their respective optimal solution. These results suggest that rats may employ principles of active learning and/or intrinsic reward in tactile exploration and can guide future research to seek the underlying neuronal mechanisms that implement them. Furthermore, these paradigms are easily transferable to biomimetic whisker-based artificial sensors and can improve the active exploration of their environment. Copyright © 2012 Elsevier Ltd. All rights reserved.

  5. Bi-directional effect of increasing doses of baclofen on reinforcement learning

    Directory of Open Access Journals (Sweden)

    Jean eTerrier

    2011-07-01

    Full Text Available In rodents as well as in humans, efficient reinforcement learning depends on dopamine (DA released from ventral tegmental area (VTA neurons. It has been shown that in brain slices of mice, GABAB-receptor agonists at low concentrations increase the firing frequency of VTA-DA neurons, while high concentrations reduce the firing frequency. It remains however elusive whether baclofen can modulate reinforcement learning. Here, in a double blind study in 34 healthy human volunteers, we tested the effects of a low and a high concentration of oral baclofen in a gambling task associated with monetary reward. A low (20 mg dose of baclofen increased the efficiency of reward-associated learning but had no effect on the avoidance of monetary loss. A high (50 mg dose of baclofen on the other hand did not affect the learning curve. At the end of the task, subjects who received 20 mg baclofen p.o. were more accurate in choosing the symbol linked to the highest probability of earning money compared to the control group (89.55±1.39% vs 81.07±1.55%, p=0.002. Our results support a model where baclofen, at low concentrations, causes a disinhibition of DA neurons, increases DA levels and thus facilitates reinforcement learning.

  6. Traffic light control by multiagent reinforcement learning systems

    NARCIS (Netherlands)

    Bakker, B.; Whiteson, S.; Kester, L.; Groen, F.C.A.; Babuška, R.; Groen, F.C.A.

    2010-01-01

    Traffic light control is one of the main means of controlling road traffic. Improving traffic control is important because it can lead to higher traffic throughput and reduced traffic congestion. This chapter describes multiagent reinforcement learning techniques for automatic optimization of

  7. Traffic Light Control by Multiagent Reinforcement Learning Systems

    NARCIS (Netherlands)

    Bakker, B.; Whiteson, S.; Kester, L.J.H.M.; Groen, F.C.A.

    2010-01-01

    Traffic light control is one of the main means of controlling road traffic. Improving traffic control is important because it can lead to higher traffic throughput and reduced traffic congestion. This chapter describes multiagent reinforcement learning techniques for automatic optimization of

  8. Dissociable neural representations of reinforcement and belief prediction errors underlie strategic learning.

    Science.gov (United States)

    Zhu, Lusha; Mathewson, Kyle E; Hsu, Ming

    2012-01-31

    Decision-making in the presence of other competitive intelligent agents is fundamental for social and economic behavior. Such decisions require agents to behave strategically, where in addition to learning about the rewards and punishments available in the environment, they also need to anticipate and respond to actions of others competing for the same rewards. However, whereas we know much about strategic learning at both theoretical and behavioral levels, we know relatively little about the underlying neural mechanisms. Here, we show using a multi-strategy competitive learning paradigm that strategic choices can be characterized by extending the reinforcement learning (RL) framework to incorporate agents' beliefs about the actions of their opponents. Furthermore, using this characterization to generate putative internal values, we used model-based functional magnetic resonance imaging to investigate neural computations underlying strategic learning. We found that the distinct notions of prediction errors derived from our computational model are processed in a partially overlapping but distinct set of brain regions. Specifically, we found that the RL prediction error was correlated with activity in the ventral striatum. In contrast, activity in the ventral striatum, as well as the rostral anterior cingulate (rACC), was correlated with a previously uncharacterized belief-based prediction error. Furthermore, activity in rACC reflected individual differences in degree of engagement in belief learning. These results suggest a model of strategic behavior where learning arises from interaction of dissociable reinforcement and belief-based inputs.

  9. A Robust Cooperated Control Method with Reinforcement Learning and Adaptive H∞ Control

    Science.gov (United States)

    Obayashi, Masanao; Uchiyama, Shogo; Kuremoto, Takashi; Kobayashi, Kunikazu

    This study proposes a robust cooperated control method combining reinforcement learning with robust control to control the system. A remarkable characteristic of the reinforcement learning is that it doesn't require model formula, however, it doesn't guarantee the stability of the system. On the other hand, robust control system guarantees stability and robustness, however, it requires model formula. We employ both the actor-critic method which is a kind of reinforcement learning with minimal amount of computation to control continuous valued actions and the traditional robust control, that is, H∞ control. The proposed system was compared method with the conventional control method, that is, the actor-critic only used, through the computer simulation of controlling the angle and the position of a crane system, and the simulation result showed the effectiveness of the proposed method.

  10. Energy Management Strategy for a Hybrid Electric Vehicle Based on Deep Reinforcement Learning

    OpenAIRE

    Yue Hu; Weimin Li; Kun Xu; Taimoor Zahid; Feiyan Qin; Chenming Li

    2018-01-01

    An energy management strategy (EMS) is important for hybrid electric vehicles (HEVs) since it plays a decisive role on the performance of the vehicle. However, the variation of future driving conditions deeply influences the effectiveness of the EMS. Most existing EMS methods simply follow predefined rules that are not adaptive to different driving conditions online. Therefore, it is useful that the EMS can learn from the environment or driving cycle. In this paper, a deep reinforcement learn...

  11. Perceptual learning rules based on reinforcers and attention

    NARCIS (Netherlands)

    Roelfsema, Pieter R.; van Ooyen, Arjen; Watanabe, Takeo

    2010-01-01

    How does the brain learn those visual features that are relevant for behavior? In this article, we focus on two factors that guide plasticity of visual representations. First, reinforcers cause the global release of diffusive neuromodulatory signals that gate plasticity. Second, attentional feedback

  12. Experiments with Online Reinforcement Learning in Real-Time Strategy Games

    DEFF Research Database (Denmark)

    Toftgaard Andersen, Kresten; Zeng, Yifeng; Dahl Christensen, Dennis

    2009-01-01

    Real-time strategy (RTS) games provide a challenging platform to implement online reinforcement learning (RL) techniques in a real application. Computer, as one game player, monitors opponents' (human or other computers) strategies and then updates its own policy using RL methods. In this article......, we first examine the suitability of applying the online RL in various computer games. Reinforcement learning application depends on both RL complexity and the game features. We then propose a multi-layer framework for implementing online RL in an RTS game. The framework significantly reduces RL...... the effectiveness of our proposed framework and shed light on relevant issues in using online RL in RTS games....

  13. No impact of repeated extinction exposures on operant responding maintained by different reinforcer rates.

    Science.gov (United States)

    Bai, John Y H; Podlesnik, Christopher A

    2017-05-01

    Greater rates of intermittent reinforcement in the presence of discriminative stimuli generally produce greater resistance to extinction, consistent with predictions of behavioral momentum theory. Other studies reveal more rapid extinction with higher rates of reinforcers - the partial reinforcement extinction effect. Further, repeated extinction often produces more rapid decreases in operant responding due to learning a discrimination between training and extinction contingencies. The present study examined extinction repeatedly with training with different rates of intermittent reinforcement in a multiple schedule. We assessed whether repeated extinction would reverse the pattern of greater resistance to extinction with greater reinforcer rates. Counter to this prediction, resistance to extinction was consistently greater across twelve assessments of training followed by six successive sessions of extinction. Moreover, patterns of responding during extinction resembled those observed during satiation tests, which should not alter discrimination processes with repeated testing. These findings join others suggesting operant responding in extinction can be durable across repeated tests. Copyright © 2017 Elsevier B.V. All rights reserved.

  14. Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments

    OpenAIRE

    Kidziński, Łukasz; Mohanty, Sharada Prasanna; Ong, Carmichael; Huang, Zhewei; Zhou, Shuchang; Pechenko, Anton; Stelmaszczyk, Adam; Jarosik, Piotr; Pavlov, Mikhail; Kolesnikov, Sergey; Plis, Sergey; Chen, Zhibo; Zhang, Zhizheng; Chen, Jiale; Shi, Jun

    2018-01-01

    In the NIPS 2017 Learning to Run challenge, participants were tasked with building a controller for a musculoskeletal model to make it run as fast as possible through an obstacle course. Top participants were invited to describe their algorithms. In this work, we present eight solutions that used deep reinforcement learning approaches, based on algorithms such as Deep Deterministic Policy Gradient, Proximal Policy Optimization, and Trust Region Policy Optimization. Many solutions use similar ...

  15. TEACHING SELF-CONTROL WITH QUALITATIVELY DIFFERENT REINFORCERS

    OpenAIRE

    Passage, Michael; Tincani, Matt; Hantula, Donald A.

    2012-01-01

    This study examined the effectiveness of using qualitatively different reinforcers to teach self-control to an adolescent boy who had been diagnosed with an intellectual disability. First, he was instructed to engage in an activity without programmed reinforcement. Next, he was instructed to engage in the activity under a two-choice fixed-duration schedule of reinforcement. Finally, he was exposed to self-control training, during which the delay to a more preferred reinforcer was initially sh...

  16. Learning to reach by reinforcement learning using a receptive field based function approximation approach with continuous actions.

    Science.gov (United States)

    Tamosiunaite, Minija; Asfour, Tamim; Wörgötter, Florentin

    2009-03-01

    Reinforcement learning methods can be used in robotics applications especially for specific target-oriented problems, for example the reward-based recalibration of goal directed actions. To this end still relatively large and continuous state-action spaces need to be efficiently handled. The goal of this paper is, thus, to develop a novel, rather simple method which uses reinforcement learning with function approximation in conjunction with different reward-strategies for solving such problems. For the testing of our method, we use a four degree-of-freedom reaching problem in 3D-space simulated by a two-joint robot arm system with two DOF each. Function approximation is based on 4D, overlapping kernels (receptive fields) and the state-action space contains about 10,000 of these. Different types of reward structures are being compared, for example, reward-on- touching-only against reward-on-approach. Furthermore, forbidden joint configurations are punished. A continuous action space is used. In spite of a rather large number of states and the continuous action space these reward/punishment strategies allow the system to find a good solution usually within about 20 trials. The efficiency of our method demonstrated in this test scenario suggests that it might be possible to use it on a real robot for problems where mixed rewards can be defined in situations where other types of learning might be difficult.

  17. Infant Contingency Learning in Different Cultural Contexts

    Science.gov (United States)

    Graf, Frauke; Lamm, Bettina; Goertz, Claudia; Kolling, Thorsten; Freitag, Claudia; Spangler, Sibylle; Fassbender, Ina; Teubert, Manuel; Vierhaus, Marc; Keller, Heidi; Lohaus, Arnold; Schwarzer, Gudrun; Knopf, Monika

    2012-01-01

    Three-month-old Cameroonian Nso farmer and German middle-class infants were compared regarding learning and retention in a computerized mobile task. Infants achieving a preset learning criterion during reinforcement were tested for immediate and long-term retention measured in terms of an increased response rate after reinforcement and after a…

  18. Functional Contour-following via Haptic Perception and Reinforcement Learning.

    Science.gov (United States)

    Hellman, Randall B; Tekin, Cem; van der Schaar, Mihaela; Santos, Veronica J

    2018-01-01

    Many tasks involve the fine manipulation of objects despite limited visual feedback. In such scenarios, tactile and proprioceptive feedback can be leveraged for task completion. We present an approach for real-time haptic perception and decision-making for a haptics-driven, functional contour-following task: the closure of a ziplock bag. This task is challenging for robots because the bag is deformable, transparent, and visually occluded by artificial fingertip sensors that are also compliant. A deep neural net classifier was trained to estimate the state of a zipper within a robot's pinch grasp. A Contextual Multi-Armed Bandit (C-MAB) reinforcement learning algorithm was implemented to maximize cumulative rewards by balancing exploration versus exploitation of the state-action space. The C-MAB learner outperformed a benchmark Q-learner by more efficiently exploring the state-action space while learning a hard-to-code task. The learned C-MAB policy was tested with novel ziplock bag scenarios and contours (wire, rope). Importantly, this work contributes to the development of reinforcement learning approaches that account for limited resources such as hardware life and researcher time. As robots are used to perform complex, physically interactive tasks in unstructured or unmodeled environments, it becomes important to develop methods that enable efficient and effective learning with physical testbeds.

  19. Challenges in the Verification of Reinforcement Learning Algorithms

    Science.gov (United States)

    Van Wesel, Perry; Goodloe, Alwyn E.

    2017-01-01

    Machine learning (ML) is increasingly being applied to a wide array of domains from search engines to autonomous vehicles. These algorithms, however, are notoriously complex and hard to verify. This work looks at the assumptions underlying machine learning algorithms as well as some of the challenges in trying to verify ML algorithms. Furthermore, we focus on the specific challenges of verifying reinforcement learning algorithms. These are highlighted using a specific example. Ultimately, we do not offer a solution to the complex problem of ML verification, but point out possible approaches for verification and interesting research opportunities.

  20. What Can Reinforcement Learning Teach Us About Non-Equilibrium Quantum Dynamics

    Science.gov (United States)

    Bukov, Marin; Day, Alexandre; Sels, Dries; Weinberg, Phillip; Polkovnikov, Anatoli; Mehta, Pankaj

    Equilibrium thermodynamics and statistical physics are the building blocks of modern science and technology. Yet, our understanding of thermodynamic processes away from equilibrium is largely missing. In this talk, I will reveal the potential of what artificial intelligence can teach us about the complex behaviour of non-equilibrium systems. Specifically, I will discuss the problem of finding optimal drive protocols to prepare a desired target state in quantum mechanical systems by applying ideas from Reinforcement Learning [one can think of Reinforcement Learning as the study of how an agent (e.g. a robot) can learn and perfect a given policy through interactions with an environment.]. The driving protocols learnt by our agent suggest that the non-equilibrium world features possibilities easily defying intuition based on equilibrium physics.

  1. Spike-based decision learning of Nash equilibria in two-player games.

    Directory of Open Access Journals (Sweden)

    Johannes Friedrich

    Full Text Available Humans and animals face decision tasks in an uncertain multi-agent environment where an agent's strategy may change in time due to the co-adaptation of others strategies. The neuronal substrate and the computational algorithms underlying such adaptive decision making, however, is largely unknown. We propose a population coding model of spiking neurons with a policy gradient procedure that successfully acquires optimal strategies for classical game-theoretical tasks. The suggested population reinforcement learning reproduces data from human behavioral experiments for the blackjack and the inspector game. It performs optimally according to a pure (deterministic and mixed (stochastic Nash equilibrium, respectively. In contrast, temporal-difference(TD-learning, covariance-learning, and basic reinforcement learning fail to perform optimally for the stochastic strategy. Spike-based population reinforcement learning, shown to follow the stochastic reward gradient, is therefore a viable candidate to explain automated decision learning of a Nash equilibrium in two-player games.

  2. Social Learning, Reinforcement and Crime: Evidence from Three European Cities

    Science.gov (United States)

    Tittle, Charles R.; Antonaccio, Olena; Botchkovar, Ekaterina

    2012-01-01

    This study reports a cross-cultural test of Social Learning Theory using direct measures of social learning constructs and focusing on the causal structure implied by the theory. Overall, the results strongly confirm the main thrust of the theory. Prior criminal reinforcement and current crime-favorable definitions are highly related in all three…

  3. Modeling Avoidance in Mood and Anxiety Disorders Using Reinforcement Learning.

    Science.gov (United States)

    Mkrtchian, Anahit; Aylward, Jessica; Dayan, Peter; Roiser, Jonathan P; Robinson, Oliver J

    2017-10-01

    Serious and debilitating symptoms of anxiety are the most common mental health problem worldwide, accounting for around 5% of all adult years lived with disability in the developed world. Avoidance behavior-avoiding social situations for fear of embarrassment, for instance-is a core feature of such anxiety. However, as for many other psychiatric symptoms the biological mechanisms underlying avoidance remain unclear. Reinforcement learning models provide formal and testable characterizations of the mechanisms of decision making; here, we examine avoidance in these terms. A total of 101 healthy participants and individuals with mood and anxiety disorders completed an approach-avoidance go/no-go task under stress induced by threat of unpredictable shock. We show an increased reliance in the mood and anxiety group on a parameter of our reinforcement learning model that characterizes a prepotent (pavlovian) bias to withhold responding in the face of negative outcomes. This was particularly the case when the mood and anxiety group was under stress. This formal description of avoidance within the reinforcement learning framework provides a new means of linking clinical symptoms with biophysically plausible models of neural circuitry and, as such, takes us closer to a mechanistic understanding of mood and anxiety disorders. Copyright © 2017 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.

  4. Reinforcement Learning for Online Control of Evolutionary Algorithms

    NARCIS (Netherlands)

    Eiben, A.; Horvath, Mark; Kowalczyk, Wojtek; Schut, Martijn

    2007-01-01

    The research reported in this paper is concerned with assessing the usefulness of reinforcment learning (RL) for on-line calibration of parameters in evolutionary algorithms (EA). We are running an RL procedure and the EA simultaneously and the RL is changing the EA parameters on-the-fly. We

  5. Video Demo: Deep Reinforcement Learning for Coordination in Traffic Light Control

    NARCIS (Netherlands)

    van der Pol, E.; Oliehoek, F.A.; Bosse, T.; Bredeweg, B.

    2016-01-01

    This video demonstration contrasts two approaches to coordination in traffic light control using reinforcement learning: earlier work, based on a deconstruction of the state space into a linear combination of vehicle states, and our own approach based on the Deep Q-learning algorithm.

  6. Visual paired-associate learning: in search of material-specific effects in adult patients who have undergone temporal lobectomy.

    Science.gov (United States)

    Smith, Mary Lou; Bigel, Marla; Miller, Laurie A

    2011-02-01

    The mesial temporal lobes are important for learning arbitrary associations. It has previously been demonstrated that left mesial temporal structures are involved in learning word pairs, but it is not yet known whether comparable lesions in the right temporal lobe impair visually mediated associative learning. Patients who had undergone left (n=16) or right (n=18) temporal lobectomy for relief of intractable epilepsy and healthy controls (n=13) were administered two paired-associate learning tasks assessing their learning and memory of pairs of abstract designs or pairs of symbols in unique locations. Both patient groups had deficits in learning the designs, but only the right temporal group was impaired in recognition. For the symbol location task, differences were not found in learning, but again a recognition deficit was found for the right temporal group. The findings implicate the mesial temporal structures in relational learning. They support a material-specific effect for recognition but not for learning and recall of arbitrary visual and visual-spatial associative information. Copyright © 2010 Elsevier Inc. All rights reserved.

  7. Simulation-based optimization parametric optimization techniques and reinforcement learning

    CERN Document Server

    Gosavi, Abhijit

    2003-01-01

    Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning introduces the evolving area of simulation-based optimization. The book's objective is two-fold: (1) It examines the mathematical governing principles of simulation-based optimization, thereby providing the reader with the ability to model relevant real-life problems using these techniques. (2) It outlines the computational technology underlying these methods. Taken together these two aspects demonstrate that the mathematical and computational methods discussed in this book do work. Broadly speaking, the book has two parts: (1) parametric (static) optimization and (2) control (dynamic) optimization. Some of the book's special features are: *An accessible introduction to reinforcement learning and parametric-optimization techniques. *A step-by-step description of several algorithms of simulation-based optimization. *A clear and simple introduction to the methodology of neural networks. *A gentle introduction to converg...

  8. Perception-based Co-evolutionary Reinforcement Learning for UAV Sensor Allocation

    National Research Council Canada - National Science Library

    Berenji, Hamid

    2003-01-01

    .... A Perception-based reasoning approach based on co-evolutionary reinforcement learning was developed for jointly addressing sensor allocation on each individual UAV and allocation of a team of UAVs...

  9. A Reinforcement-Based Learning Paradigm Increases Anatomical Learning and Retention-A Neuroeducation Study.

    Science.gov (United States)

    Anderson, Sarah J; Hecker, Kent G; Krigolson, Olave E; Jamniczky, Heather A

    2018-01-01

    In anatomy education, a key hurdle to engaging in higher-level discussion in the classroom is recognizing and understanding the extensive terminology used to identify and describe anatomical structures. Given the time-limited classroom environment, seeking methods to impart this foundational knowledge to students in an efficient manner is essential. Just-in-Time Teaching (JiTT) methods incorporate pre-class exercises (typically online) meant to establish foundational knowledge in novice learners so subsequent instructor-led sessions can focus on deeper, more complex concepts. Determining how best do we design and assess pre-class exercises requires a detailed examination of learning and retention in an applied educational context. Here we used electroencephalography (EEG) as a quantitative dependent variable to track learning and examine the efficacy of JiTT activities to teach anatomy. Specifically, we examined changes in the amplitude of the N250 and reward positivity event-related brain potential (ERP) components alongside behavioral performance as novice students participated in a series of computerized reinforcement-based learning modules to teach neuroanatomical structures. We found that as students learned to identify anatomical structures, the amplitude of the N250 increased and reward positivity amplitude decreased in response to positive feedback. Both on a retention and transfer exercise when learners successfully remembered and translated their knowledge to novel images, the amplitude of the reward positivity remained decreased compared to early learning. Our findings suggest ERPs can be used as a tool to track learning, retention, and transfer of knowledge and that employing the reinforcement learning paradigm is an effective educational approach for developing anatomical expertise.

  10. Multisensory perceptual learning of temporal order: audiovisual learning transfers to vision but not audition.

    Directory of Open Access Journals (Sweden)

    David Alais

    2010-06-01

    Full Text Available An outstanding question in sensory neuroscience is whether the perceived timing of events is mediated by a central supra-modal timing mechanism, or multiple modality-specific systems. We use a perceptual learning paradigm to address this question.Three groups were trained daily for 10 sessions on an auditory, a visual or a combined audiovisual temporal order judgment (TOJ. Groups were pre-tested on a range TOJ tasks within and between their group modality prior to learning so that transfer of any learning from the trained task could be measured by post-testing other tasks. Robust TOJ learning (reduced temporal order discrimination thresholds occurred for all groups, although auditory learning (dichotic 500/2000 Hz tones was slightly weaker than visual learning (lateralised grating patches. Crossmodal TOJs also displayed robust learning. Post-testing revealed that improvements in temporal resolution acquired during visual learning transferred within modality to other retinotopic locations and orientations, but not to auditory or crossmodal tasks. Auditory learning did not transfer to visual or crossmodal tasks, and neither did it transfer within audition to another frequency pair. In an interesting asymmetry, crossmodal learning transferred to all visual tasks but not to auditory tasks. Finally, in all conditions, learning to make TOJs for stimulus onsets did not transfer at all to discriminating temporal offsets. These data present a complex picture of timing processes.The lack of transfer between unimodal groups indicates no central supramodal timing process for this task; however, the audiovisual-to-visual transfer cannot be explained without some form of sensory interaction. We propose that auditory learning occurred in frequency-tuned processes in the periphery, precluding interactions with more central visual and audiovisual timing processes. Functionally the patterns of featural transfer suggest that perceptual learning of temporal order

  11. Multisensory perceptual learning of temporal order: audiovisual learning transfers to vision but not audition.

    Science.gov (United States)

    Alais, David; Cass, John

    2010-06-23

    An outstanding question in sensory neuroscience is whether the perceived timing of events is mediated by a central supra-modal timing mechanism, or multiple modality-specific systems. We use a perceptual learning paradigm to address this question. Three groups were trained daily for 10 sessions on an auditory, a visual or a combined audiovisual temporal order judgment (TOJ). Groups were pre-tested on a range TOJ tasks within and between their group modality prior to learning so that transfer of any learning from the trained task could be measured by post-testing other tasks. Robust TOJ learning (reduced temporal order discrimination thresholds) occurred for all groups, although auditory learning (dichotic 500/2000 Hz tones) was slightly weaker than visual learning (lateralised grating patches). Crossmodal TOJs also displayed robust learning. Post-testing revealed that improvements in temporal resolution acquired during visual learning transferred within modality to other retinotopic locations and orientations, but not to auditory or crossmodal tasks. Auditory learning did not transfer to visual or crossmodal tasks, and neither did it transfer within audition to another frequency pair. In an interesting asymmetry, crossmodal learning transferred to all visual tasks but not to auditory tasks. Finally, in all conditions, learning to make TOJs for stimulus onsets did not transfer at all to discriminating temporal offsets. These data present a complex picture of timing processes. The lack of transfer between unimodal groups indicates no central supramodal timing process for this task; however, the audiovisual-to-visual transfer cannot be explained without some form of sensory interaction. We propose that auditory learning occurred in frequency-tuned processes in the periphery, precluding interactions with more central visual and audiovisual timing processes. Functionally the patterns of featural transfer suggest that perceptual learning of temporal order may be

  12. Identification of animal behavioral strategies by inverse reinforcement learning.

    Directory of Open Access Journals (Sweden)

    Shoichiro Yamaguchi

    2018-05-01

    Full Text Available Animals are able to reach a desired state in an environment by controlling various behavioral patterns. Identification of the behavioral strategy used for this control is important for understanding animals' decision-making and is fundamental to dissect information processing done by the nervous system. However, methods for quantifying such behavioral strategies have not been fully established. In this study, we developed an inverse reinforcement-learning (IRL framework to identify an animal's behavioral strategy from behavioral time-series data. We applied this framework to C. elegans thermotactic behavior; after cultivation at a constant temperature with or without food, fed worms prefer, while starved worms avoid the cultivation temperature on a thermal gradient. Our IRL approach revealed that the fed worms used both the absolute temperature and its temporal derivative and that their behavior involved two strategies: directed migration (DM and isothermal migration (IM. With DM, worms efficiently reached specific temperatures, which explains their thermotactic behavior when fed. With IM, worms moved along a constant temperature, which reflects isothermal tracking, well-observed in previous studies. In contrast to fed animals, starved worms escaped the cultivation temperature using only the absolute, but not the temporal derivative of temperature. We also investigated the neural basis underlying these strategies, by applying our method to thermosensory neuron-deficient worms. Thus, our IRL-based approach is useful in identifying animal strategies from behavioral time-series data and could be applied to a wide range of behavioral studies, including decision-making, in other organisms.

  13. Online constrained model-based reinforcement learning

    CSIR Research Space (South Africa)

    Van Niekerk, B

    2017-08-01

    Full Text Available Constrained Model-based Reinforcement Learning Benjamin van Niekerk School of Computer Science University of the Witwatersrand South Africa Andreas Damianou∗ Amazon.com Cambridge, UK Benjamin Rosman Council for Scientific and Industrial Research, and School... MULTIPLE SHOOTING Using direct multiple shooting (Bock and Plitt, 1984), problem (1) can be transformed into a structured non- linear program (NLP). First, the time horizon [t0, t0 + T ] is partitioned into N equal subintervals [tk, tk+1] for k = 0...

  14. Reinforcement learning account of network reciprocity.

    Science.gov (United States)

    Ezaki, Takahiro; Masuda, Naoki

    2017-01-01

    Evolutionary game theory predicts that cooperation in social dilemma games is promoted when agents are connected as a network. However, when networks are fixed over time, humans do not necessarily show enhanced mutual cooperation. Here we show that reinforcement learning (specifically, the so-called Bush-Mosteller model) approximately explains the experimentally observed network reciprocity and the lack thereof in a parameter region spanned by the benefit-to-cost ratio and the node's degree. Thus, we significantly extend previously obtained numerical results.

  15. Distributed Economic Dispatch in Microgrids Based on Cooperative Reinforcement Learning.

    Science.gov (United States)

    Liu, Weirong; Zhuang, Peng; Liang, Hao; Peng, Jun; Huang, Zhiwu; Weirong Liu; Peng Zhuang; Hao Liang; Jun Peng; Zhiwu Huang; Liu, Weirong; Liang, Hao; Peng, Jun; Zhuang, Peng; Huang, Zhiwu

    2018-06-01

    Microgrids incorporated with distributed generation (DG) units and energy storage (ES) devices are expected to play more and more important roles in the future power systems. Yet, achieving efficient distributed economic dispatch in microgrids is a challenging issue due to the randomness and nonlinear characteristics of DG units and loads. This paper proposes a cooperative reinforcement learning algorithm for distributed economic dispatch in microgrids. Utilizing the learning algorithm can avoid the difficulty of stochastic modeling and high computational complexity. In the cooperative reinforcement learning algorithm, the function approximation is leveraged to deal with the large and continuous state spaces. And a diffusion strategy is incorporated to coordinate the actions of DG units and ES devices. Based on the proposed algorithm, each node in microgrids only needs to communicate with its local neighbors, without relying on any centralized controllers. Algorithm convergence is analyzed, and simulations based on real-world meteorological and load data are conducted to validate the performance of the proposed algorithm.

  16. A Neuro-Control Design Based on Fuzzy Reinforcement Learning

    DEFF Research Database (Denmark)

    Katebi, S.D.; Blanke, M.

    This paper describes a neuro-control fuzzy critic design procedure based on reinforcement learning. An important component of the proposed intelligent control configuration is the fuzzy credit assignment unit which acts as a critic, and through fuzzy implications provides adjustment mechanisms....... The fuzzy credit assignment unit comprises a fuzzy system with the appropriate fuzzification, knowledge base and defuzzification components. When an external reinforcement signal (a failure signal) is received, sequences of control actions are evaluated and modified by the action applier unit. The desirable...... ones instruct the neuro-control unit to adjust its weights and are simultaneously stored in the memory unit during the training phase. In response to the internal reinforcement signal (set point threshold deviation), the stored information is retrieved by the action applier unit and utilized for re...

  17. Emotion in reinforcement learning agents and robots : A survey

    NARCIS (Netherlands)

    Moerland, T.M.; Broekens, D.J.; Jonker, C.M.

    2018-01-01

    This article provides the first survey of computational models of emotion in reinforcement learning (RL) agents. The survey focuses on agent/robot emotions, and mostly ignores human user emotions. Emotions are recognized as functional in decision-making by influencing motivation and action

  18. Different levels of food restriction reveal genotype-specific differences in learning a visual discrimination task.

    Directory of Open Access Journals (Sweden)

    Kalina Makowiecki

    Full Text Available In behavioural experiments, motivation to learn can be achieved using food rewards as positive reinforcement in food-restricted animals. Previous studies reduce animal weights to 80-90% of free-feeding body weight as the criterion for food restriction. However, effects of different degrees of food restriction on task performance have not been assessed. We compared learning task performance in mice food-restricted to 80 or 90% body weight (BW. We used adult wildtype (WT; C57Bl/6j and knockout (ephrin-A2⁻/⁻ mice, previously shown to have a reverse learning deficit. Mice were trained in a two-choice visual discrimination task with food reward as positive reinforcement. When mice reached criterion for one visual stimulus (80% correct in three consecutive 10 trial sets they began the reverse learning phase, where the rewarded stimulus was switched to the previously incorrect stimulus. For the initial learning and reverse phase of the task, mice at 90%BW took almost twice as many trials to reach criterion as mice at 80%BW. Furthermore, WT 80 and 90%BW groups significantly differed in percentage correct responses and learning strategy in the reverse learning phase, whereas no differences between weight restriction groups were observed in ephrin-A2⁻/⁻ mice. Most importantly, genotype-specific differences in reverse learning strategy were only detected in the 80%BW groups. Our results indicate that increased food restriction not only results in better performance and a shorter training period, but may also be necessary for revealing behavioural differences between experimental groups. This has important ethical and animal welfare implications when deciding extent of diet restriction in behavioural studies.

  19. Neural mechanisms of reinforcement learning in unmedicated patients with major depressive disorder.

    Science.gov (United States)

    Rothkirch, Marcus; Tonn, Jonas; Köhler, Stephan; Sterzer, Philipp

    2017-04-01

    According to current concepts, major depressive disorder is strongly related to dysfunctional neural processing of motivational information, entailing impairments in reinforcement learning. While computational modelling can reveal the precise nature of neural learning signals, it has not been used to study learning-related neural dysfunctions in unmedicated patients with major depressive disorder so far. We thus aimed at comparing the neural coding of reward and punishment prediction errors, representing indicators of neural learning-related processes, between unmedicated patients with major depressive disorder and healthy participants. To this end, a group of unmedicated patients with major depressive disorder (n = 28) and a group of age- and sex-matched healthy control participants (n = 30) completed an instrumental learning task involving monetary gains and losses during functional magnetic resonance imaging. The two groups did not differ in their learning performance. Patients and control participants showed the same level of prediction error-related activity in the ventral striatum and the anterior insula. In contrast, neural coding of reward prediction errors in the medial orbitofrontal cortex was reduced in patients. Moreover, neural reward prediction error signals in the medial orbitofrontal cortex and ventral striatum showed negative correlations with anhedonia severity. Using a standard instrumental learning paradigm we found no evidence for an overall impairment of reinforcement learning in medication-free patients with major depressive disorder. Importantly, however, the attenuated neural coding of reward in the medial orbitofrontal cortex and the relation between anhedonia and reduced reward prediction error-signalling in the medial orbitofrontal cortex and ventral striatum likely reflect an impairment in experiencing pleasure from rewarding events as a key mechanism of anhedonia in major depressive disorder. © The Author (2017). Published by Oxford

  20. Reinforcement learning account of network reciprocity.

    Directory of Open Access Journals (Sweden)

    Takahiro Ezaki

    Full Text Available Evolutionary game theory predicts that cooperation in social dilemma games is promoted when agents are connected as a network. However, when networks are fixed over time, humans do not necessarily show enhanced mutual cooperation. Here we show that reinforcement learning (specifically, the so-called Bush-Mosteller model approximately explains the experimentally observed network reciprocity and the lack thereof in a parameter region spanned by the benefit-to-cost ratio and the node's degree. Thus, we significantly extend previously obtained numerical results.

  1. A Reinforcement-Based Learning Paradigm Increases Anatomical Learning and Retention—A Neuroeducation Study

    Science.gov (United States)

    Anderson, Sarah J.; Hecker, Kent G.; Krigolson, Olave E.; Jamniczky, Heather A.

    2018-01-01

    In anatomy education, a key hurdle to engaging in higher-level discussion in the classroom is recognizing and understanding the extensive terminology used to identify and describe anatomical structures. Given the time-limited classroom environment, seeking methods to impart this foundational knowledge to students in an efficient manner is essential. Just-in-Time Teaching (JiTT) methods incorporate pre-class exercises (typically online) meant to establish foundational knowledge in novice learners so subsequent instructor-led sessions can focus on deeper, more complex concepts. Determining how best do we design and assess pre-class exercises requires a detailed examination of learning and retention in an applied educational context. Here we used electroencephalography (EEG) as a quantitative dependent variable to track learning and examine the efficacy of JiTT activities to teach anatomy. Specifically, we examined changes in the amplitude of the N250 and reward positivity event-related brain potential (ERP) components alongside behavioral performance as novice students participated in a series of computerized reinforcement-based learning modules to teach neuroanatomical structures. We found that as students learned to identify anatomical structures, the amplitude of the N250 increased and reward positivity amplitude decreased in response to positive feedback. Both on a retention and transfer exercise when learners successfully remembered and translated their knowledge to novel images, the amplitude of the reward positivity remained decreased compared to early learning. Our findings suggest ERPs can be used as a tool to track learning, retention, and transfer of knowledge and that employing the reinforcement learning paradigm is an effective educational approach for developing anatomical expertise. PMID:29467638

  2. A Reinforcement-Based Learning Paradigm Increases Anatomical Learning and Retention—A Neuroeducation Study

    Directory of Open Access Journals (Sweden)

    Sarah J. Anderson

    2018-02-01

    Full Text Available In anatomy education, a key hurdle to engaging in higher-level discussion in the classroom is recognizing and understanding the extensive terminology used to identify and describe anatomical structures. Given the time-limited classroom environment, seeking methods to impart this foundational knowledge to students in an efficient manner is essential. Just-in-Time Teaching (JiTT methods incorporate pre-class exercises (typically online meant to establish foundational knowledge in novice learners so subsequent instructor-led sessions can focus on deeper, more complex concepts. Determining how best do we design and assess pre-class exercises requires a detailed examination of learning and retention in an applied educational context. Here we used electroencephalography (EEG as a quantitative dependent variable to track learning and examine the efficacy of JiTT activities to teach anatomy. Specifically, we examined changes in the amplitude of the N250 and reward positivity event-related brain potential (ERP components alongside behavioral performance as novice students participated in a series of computerized reinforcement-based learning modules to teach neuroanatomical structures. We found that as students learned to identify anatomical structures, the amplitude of the N250 increased and reward positivity amplitude decreased in response to positive feedback. Both on a retention and transfer exercise when learners successfully remembered and translated their knowledge to novel images, the amplitude of the reward positivity remained decreased compared to early learning. Our findings suggest ERPs can be used as a tool to track learning, retention, and transfer of knowledge and that employing the reinforcement learning paradigm is an effective educational approach for developing anatomical expertise.

  3. Optimal Control via Reinforcement Learning with Symbolic Policy Approximation

    NARCIS (Netherlands)

    Kubalìk, Jiřì; Alibekov, Eduard; Babuska, R.; Dochain, Denis; Henrion, Didier; Peaucelle, Dimitri

    2017-01-01

    Model-based reinforcement learning (RL) algorithms can be used to derive optimal control laws for nonlinear dynamic systems. With continuous-valued state and input variables, RL algorithms have to rely on function approximators to represent the value function and policy mappings. This paper

  4. The chronotron: a neuron that learns to fire temporally precise spike patterns.

    Directory of Open Access Journals (Sweden)

    Răzvan V Florian

    Full Text Available In many cases, neurons process information carried by the precise timings of spikes. Here we show how neurons can learn to generate specific temporally precise output spikes in response to input patterns of spikes having precise timings, thus processing and memorizing information that is entirely temporally coded, both as input and as output. We introduce two new supervised learning rules for spiking neurons with temporal coding of information (chronotrons, one that provides high memory capacity (E-learning, and one that has a higher biological plausibility (I-learning. With I-learning, the neuron learns to fire the target spike trains through synaptic changes that are proportional to the synaptic currents at the timings of real and target output spikes. We study these learning rules in computer simulations where we train integrate-and-fire neurons. Both learning rules allow neurons to fire at the desired timings, with sub-millisecond precision. We show how chronotrons can learn to classify their inputs, by firing identical, temporally precise spike trains for different inputs belonging to the same class. When the input is noisy, the classification also leads to noise reduction. We compute lower bounds for the memory capacity of chronotrons and explore the influence of various parameters on chronotrons' performance. The chronotrons can model neurons that encode information in the time of the first spike relative to the onset of salient stimuli or neurons in oscillatory networks that encode information in the phases of spikes relative to the background oscillation. Our results show that firing one spike per cycle optimizes memory capacity in neurons encoding information in the phase of firing relative to a background rhythm.

  5. Stochastic abstract policies: generalizing knowledge to improve reinforcement learning.

    Science.gov (United States)

    Koga, Marcelo L; Freire, Valdinei; Costa, Anna H R

    2015-01-01

    Reinforcement learning (RL) enables an agent to learn behavior by acquiring experience through trial-and-error interactions with a dynamic environment. However, knowledge is usually built from scratch and learning to behave may take a long time. Here, we improve the learning performance by leveraging prior knowledge; that is, the learner shows proper behavior from the beginning of a target task, using the knowledge from a set of known, previously solved, source tasks. In this paper, we argue that building stochastic abstract policies that generalize over past experiences is an effective way to provide such improvement and this generalization outperforms the current practice of using a library of policies. We achieve that contributing with a new algorithm, AbsProb-PI-multiple and a framework for transferring knowledge represented as a stochastic abstract policy in new RL tasks. Stochastic abstract policies offer an effective way to encode knowledge because the abstraction they provide not only generalizes solutions but also facilitates extracting the similarities among tasks. We perform experiments in a robotic navigation environment and analyze the agent's behavior throughout the learning process and also assess the transfer ratio for different amounts of source tasks. We compare our method with the transfer of a library of policies, and experiments show that the use of a generalized policy produces better results by more effectively guiding the agent when learning a target task.

  6. Network Supervision of Adult Experience and Learning Dependent Sensory Cortical Plasticity.

    Science.gov (United States)

    Blake, David T

    2017-06-18

    The brain is capable of remodeling throughout life. The sensory cortices provide a useful preparation for studying neuroplasticity both during development and thereafter. In adulthood, sensory cortices change in the cortical area activated by behaviorally relevant stimuli, by the strength of response within that activated area, and by the temporal profiles of those responses. Evidence supports forms of unsupervised, reinforcement, and fully supervised network learning rules. Studies on experience-dependent plasticity have mostly not controlled for learning, and they find support for unsupervised learning mechanisms. Changes occur with greatest ease in neurons containing α-CamKII, which are pyramidal neurons in layers II/III and layers V/VI. These changes use synaptic mechanisms including long term depression. Synaptic strengthening at NMDA-containing synapses does occur, but its weak association with activity suggests other factors also initiate changes. Studies that control learning find support of reinforcement learning rules and limited evidence of other forms of supervised learning. Behaviorally associating a stimulus with reinforcement leads to a strengthening of cortical response strength and enlarging of response area with poor selectivity. Associating a stimulus with omission of reinforcement leads to a selective weakening of responses. In some preparations in which these associations are not as clearly made, neurons with the most informative discharges are relatively stronger after training. Studies analyzing the temporal profile of responses associated with omission of reward, or of plasticity in studies with different discriminanda but statistically matched stimuli, support the existence of limited supervised network learning. © 2017 American Physiological Society. Compr Physiol 7:977-1008, 2017. Copyright © 2017 John Wiley & Sons, Inc.

  7. Optimal control in microgrid using multi-agent reinforcement learning.

    Science.gov (United States)

    Li, Fu-Dong; Wu, Min; He, Yong; Chen, Xin

    2012-11-01

    This paper presents an improved reinforcement learning method to minimize electricity costs on the premise of satisfying the power balance and generation limit of units in a microgrid with grid-connected mode. Firstly, the microgrid control requirements are analyzed and the objective function of optimal control for microgrid is proposed. Then, a state variable "Average Electricity Price Trend" which is used to express the most possible transitions of the system is developed so as to reduce the complexity and randomicity of the microgrid, and a multi-agent architecture including agents, state variables, action variables and reward function is formulated. Furthermore, dynamic hierarchical reinforcement learning, based on change rate of key state variable, is established to carry out optimal policy exploration. The analysis shows that the proposed method is beneficial to handle the problem of "curse of dimensionality" and speed up learning in the unknown large-scale world. Finally, the simulation results under JADE (Java Agent Development Framework) demonstrate the validity of the presented method in optimal control for a microgrid with grid-connected mode. Copyright © 2012 ISA. Published by Elsevier Ltd. All rights reserved.

  8. Different mechanisms in learning different second languages: Evidence from English speakers learning Chinese and Spanish.

    Science.gov (United States)

    Cao, Fan; Sussman, Bethany L; Rios, Valeria; Yan, Xin; Wang, Zhao; Spray, Gregory J; Mack, Ryan M

    2017-03-01

    Word reading has been found to be associated with different neural networks in different languages, with greater involvement of the lexical pathway for opaque languages and greater invovlement of the sub-lexical pathway for transparent langauges. However, we do not know whether this language divergence can be demonstrated in second langauge learners, how learner's metalinguistic ability would modulate the langauge divergence, or whether learning method would interact with the language divergence. In this study, we attempted to answer these questions by comparing brain activations of Chinese and Spanish word reading in native English-speaking adults who learned Chinese and Spanish over a 2 week period under three learning conditions: phonological, handwriting, and passive viewing. We found that mapping orthography to phonology in Chinese had greater activation in the left inferior frontal gyrus (IFG) and left inferior temporal gyrus (ITG) than in Spanish, suggesting greater invovlement of the lexical pathway in opaque langauges. In contrast, Spanish words evoked greater activation in the left superior temporal gyrus (STG) than English, suggesting greater invovlement of the sublexical pathway for transparant languages. Furthermore, brain-behavior correlation analyses found that higher phonological awareness and rapid naming were associated with greater activation in the bilateral IFG for Chinese and in the bilateral STG for Spanish, suggesting greater language divergence in participants with higher meta-linguistic awareness. Finally, a significant interaction between the language and learning condition was found in the left STG and middle frontal gyrus (MFG), with greater activation in handwriting learning than viewing learning in the left STG only for Spanish, and greater activation in handwriting learning than phonological learning in the left MFG only for Chinese. These findings suggest that handwriting facilitates assembled phonology in Spanish and addressed

  9. Differential Spatio-temporal Multiband Satellite Image Clustering using K-means Optimization With Reinforcement Programming

    Directory of Open Access Journals (Sweden)

    Irene Erlyn Wina Rachmawan

    2015-06-01

    Full Text Available Deforestration is one of the crucial issues in Indonesia because now Indonesia has world's highest deforestation rate. In other hand, multispectral image delivers a great source of data for studying spatial and temporal changeability of the environmental such as deforestration area. This research present differential image processing methods for detecting nature change of deforestration. Our differential image processing algorithms extract and indicating area automatically. The feature of our proposed idea produce extracted information from multiband satellite image and calculate the area of deforestration by years with calculating data using temporal dataset. Yet, multiband satellite image consists of big data size that were difficult to be handled for segmentation. Commonly, K- Means clustering is considered to be a powerfull clustering algorithm because of its ability to clustering big data. However K-Means has sensitivity of its first generated centroids, which could lead into a bad performance. In this paper we propose a new approach to optimize K-Means clustering using Reinforcement Programming in order to clustering multispectral image. We build a new mechanism for generating initial centroids by implementing exploration and exploitation knowledge from Reinforcement Programming. This optimization will lead a better result for K-means data cluster. We select multispectral image from Landsat 7 in past ten years in Medawai, Borneo, Indonesia, and apply two segmentation areas consist of deforestration land and forest field. We made series of experiments and compared the experimental results of K-means using Reinforcement Programming as optimizing initiate centroid and normal K-means without optimization process. Keywords: Deforestration, Multispectral images, landsat, automatic clustering, K-means.

  10. Integrating distributed Bayesian inference and reinforcement learning for sensor management

    NARCIS (Netherlands)

    Grappiolo, C.; Whiteson, S.; Pavlin, G.; Bakker, B.

    2009-01-01

    This paper introduces a sensor management approach that integrates distributed Bayesian inference (DBI) and reinforcement learning (RL). DBI is implemented using distributed perception networks (DPNs), a multiagent approach to performing efficient inference, while RL is used to automatically

  11. Learning User Preferences in Ubiquitous Systems: A User Study and a Reinforcement Learning Approach

    OpenAIRE

    Zaidenberg , Sofia; Reignier , Patrick; Mandran , Nadine

    2010-01-01

    International audience; Our study concerns a virtual assistant, proposing services to the user based on its current perceived activity and situation (ambient intelligence). Instead of asking the user to define his preferences, we acquire them automatically using a reinforcement learning approach. Experiments showed that our system succeeded the learning of user preferences. In order to validate the relevance and usability of such a system, we have first conducted a user study. 26 non-expert s...

  12. Teaching Self-Control with Qualitatively Different Reinforcers

    Science.gov (United States)

    Passage, Michael; Tincani, Matt; Hantula, Donald A.

    2012-01-01

    This study examined the effectiveness of using qualitatively different reinforcers to teach self-control to an adolescent boy who had been diagnosed with an intellectual disability. First, he was instructed to engage in an activity without programmed reinforcement. Next, he was instructed to engage in the activity under a two-choice fixed-duration…

  13. Multiagent cooperation and competition with deep reinforcement learning.

    Directory of Open Access Journals (Sweden)

    Ardi Tampuu

    Full Text Available Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments.

  14. Multiagent cooperation and competition with deep reinforcement learning

    Science.gov (United States)

    Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul

    2017-01-01

    Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments. PMID:28380078

  15. Multiagent cooperation and competition with deep reinforcement learning.

    Science.gov (United States)

    Tampuu, Ardi; Matiisen, Tambet; Kodelja, Dorian; Kuzovkin, Ilya; Korjus, Kristjan; Aru, Juhan; Aru, Jaan; Vicente, Raul

    2017-01-01

    Evolution of cooperation and competition can appear when multiple adaptive agents share a biological, social, or technological niche. In the present work we study how cooperation and competition emerge between autonomous agents that learn by reinforcement while using only their raw visual input as the state representation. In particular, we extend the Deep Q-Learning framework to multiagent environments to investigate the interaction between two learning agents in the well-known video game Pong. By manipulating the classical rewarding scheme of Pong we show how competitive and collaborative behaviors emerge. We also describe the progression from competitive to collaborative behavior when the incentive to cooperate is increased. Finally we show how learning by playing against another adaptive agent, instead of against a hard-wired algorithm, results in more robust strategies. The present work shows that Deep Q-Networks can become a useful tool for studying decentralized learning of multiagent systems coping with high-dimensional environments.

  16. Extinction of Pavlovian conditioning: The influence of trial number and reinforcement history.

    Science.gov (United States)

    Chan, C K J; Harris, Justin A

    2017-08-01

    Pavlovian conditioning is sensitive to the temporal relationship between the conditioned stimulus (CS) and the unconditioned stimulus (US). This has motivated models that describe learning as a process that continuously updates associative strength during the trial or specifically encodes the CS-US interval. These models predict that extinction of responding is also continuous, such that response loss is proportional to the cumulative duration of exposure to the CS without the US. We review evidence showing that this prediction is incorrect, and that extinction is trial-based rather than time-based. We also present two experiments that test the importance of trials versus time on the Partial Reinforcement Extinction Effect (PREE), in which responding extinguishes more slowly for a CS that was inconsistently reinforced with the US than for a consistently reinforced one. We show that increasing the number of extinction trials of the partially reinforced CS, relative to the consistently reinforced CS, overcomes the PREE. However, increasing the duration of extinction trials by the same amount does not overcome the PREE. We conclude that animals learn about the likelihood of the US per trial during conditioning, and learn trial-by-trial about the absence of the US during extinction. Moreover, what they learn about the likelihood of the US during conditioning affects how sensitive they are to the absence of the US during extinction. Copyright © 2017 Elsevier B.V. All rights reserved.

  17. Team learning : New insights through a temporal lens

    NARCIS (Netherlands)

    Lehmann-Willenbrock, N.

    2017-01-01

    Team learning is a complex social phenomenon that develops and changes over time. Hence, to promote understanding of the fine-grained dynamics of team learning, research should account for the temporal patterns of team learning behavior. Taking important steps in this direction, this special issue

  18. Optimal and Autonomous Control Using Reinforcement Learning: A Survey.

    Science.gov (United States)

    Kiumarsi, Bahare; Vamvoudakis, Kyriakos G; Modares, Hamidreza; Lewis, Frank L

    2018-06-01

    This paper reviews the current state of the art on reinforcement learning (RL)-based feedback control solutions to optimal regulation and tracking of single and multiagent systems. Existing RL solutions to both optimal and control problems, as well as graphical games, will be reviewed. RL methods learn the solution to optimal control and game problems online and using measured data along the system trajectories. We discuss Q-learning and the integral RL algorithm as core algorithms for discrete-time (DT) and continuous-time (CT) systems, respectively. Moreover, we discuss a new direction of off-policy RL for both CT and DT systems. Finally, we review several applications.

  19. Multiagent Reinforcement Learning with Regret Matching for Robot Soccer

    Directory of Open Access Journals (Sweden)

    Qiang Liu

    2013-01-01

    Full Text Available This paper proposes a novel multiagent reinforcement learning (MARL algorithm Nash- learning with regret matching, in which regret matching is used to speed up the well-known MARL algorithm Nash- learning. It is critical that choosing a suitable strategy for action selection to harmonize the relation between exploration and exploitation to enhance the ability of online learning for Nash- learning. In Markov Game the joint action of agents adopting regret matching algorithm can converge to a group of points of no-regret that can be viewed as coarse correlated equilibrium which includes Nash equilibrium in essence. It is can be inferred that regret matching can guide exploration of the state-action space so that the rate of convergence of Nash- learning algorithm can be increased. Simulation results on robot soccer validate that compared to original Nash- learning algorithm, the use of regret matching during the learning phase of Nash- learning has excellent ability of online learning and results in significant performance in terms of scores, average reward and policy convergence.

  20. Multi-Objective Reinforcement Learning-Based Deep Neural Networks for Cognitive Space Communications

    Science.gov (United States)

    Ferreria, Paulo Victor R.; Paffenroth, Randy; Wyglinski, Alexander M.; Hackett, Timothy M.; Bilen, Sven G.; Reinhart, Richard C.; Mortensen, Dale J.

    2017-01-01

    Future communication subsystems of space exploration missions can potentially benefit from software-defined radios (SDRs) controlled by machine learning algorithms. In this paper, we propose a novel hybrid radio resource allocation management control algorithm that integrates multi-objective reinforcement learning and deep artificial neural networks. The objective is to efficiently manage communications system resources by monitoring performance functions with common dependent variables that result in conflicting goals. The uncertainty in the performance of thousands of different possible combinations of radio parameters makes the trade-off between exploration and exploitation in reinforcement learning (RL) much more challenging for future critical space-based missions. Thus, the system should spend as little time as possible on exploring actions, and whenever it explores an action, it should perform at acceptable levels most of the time. The proposed approach enables on-line learning by interactions with the environment and restricts poor resource allocation performance through virtual environment exploration. Improvements in the multiobjective performance can be achieved via transmitter parameter adaptation on a packet-basis, with poorly predicted performance promptly resulting in rejected decisions. Simulations presented in this work considered the DVB-S2 standard adaptive transmitter parameters and additional ones expected to be present in future adaptive radio systems. Performance results are provided by analysis of the proposed hybrid algorithm when operating across a satellite communication channel from Earth to GEO orbit during clear sky conditions. The proposed approach constitutes part of the core cognitive engine proof-of-concept to be delivered to the NASA Glenn Research Center SCaN Testbed located onboard the International Space Station.

  1. Multiagent Reinforcement Learning Dynamic Spectrum Access in Cognitive Radios

    Directory of Open Access Journals (Sweden)

    Wu Chun

    2014-02-01

    Full Text Available A multiuser independent Q-learning method which does not need information interaction is proposed for multiuser dynamic spectrum accessing in cognitive radios. The method adopts self-learning paradigm, in which each CR user performs reinforcement learning only through observing individual performance reward without spending communication resource on information interaction with others. The reward is defined suitably to present channel quality and channel conflict status. The learning strategy of sufficient exploration, preference for good channel, and punishment for channel conflict is designed to implement multiuser dynamic spectrum accessing. In two users two channels scenario, a fast learning algorithm is proposed and the convergence to maximal whole reward is proved. The simulation results show that, with the proposed method, the CR system can obtain convergence of Nash equilibrium with large probability and achieve great performance of whole reward.

  2. Learning of Temporal and Spatial Movement Aspects: A Comparison of Four Types of Haptic Control and Concurrent Visual Feedback.

    Science.gov (United States)

    Rauter, Georg; Sigrist, Roland; Riener, Robert; Wolf, Peter

    2015-01-01

    In literature, the effectiveness of haptics for motor learning is controversially discussed. Haptics is believed to be effective for motor learning in general; however, different types of haptic control enhance different movement aspects. Thus, in dependence on the movement aspects of interest, one type of haptic control may be effective whereas another one is not. Therefore, in the current work, it was investigated if and how different types of haptic controllers affect learning of spatial and temporal movement aspects. In particular, haptic controllers that enforce active participation of the participants were expected to improve spatial aspects. Only haptic controllers that provide feedback about the task's velocity profile were expected to improve temporal aspects. In a study on learning a complex trunk-arm rowing task, the effect of training with four different types of haptic control was investigated: position control, path control, adaptive path control, and reactive path control. A fifth group (control) trained with visual concurrent augmented feedback. As hypothesized, the position controller was most effective for learning of temporal movement aspects, while the path controller was most effective in teaching spatial movement aspects of the rowing task. Visual feedback was also effective for learning temporal and spatial movement aspects.

  3. DYNAMIC AND INCREMENTAL EXPLORATION STRATEGY IN FUSION ADAPTIVE RESONANCE THEORY FOR ONLINE REINFORCEMENT LEARNING

    Directory of Open Access Journals (Sweden)

    Budhitama Subagdja

    2016-06-01

    Full Text Available One of the fundamental challenges in reinforcement learning is to setup a proper balance between exploration and exploitation to obtain the maximum cummulative reward in the long run. Most protocols for exploration bound the overall values to a convergent level of performance. If new knowledge is inserted or the environment is suddenly changed, the issue becomes more intricate as the exploration must compromise the pre-existing knowledge. This paper presents a type of multi-channel adaptive resonance theory (ART neural network model called fusion ART which serves as a fuzzy approximator for reinforcement learning with inherent features that can regulate the exploration strategy. This intrinsic regulation is driven by the condition of the knowledge learnt so far by the agent. The model offers a stable but incremental reinforcement learning that can involve prior rules as bootstrap knowledge for guiding the agent to select the right action. Experiments in obstacle avoidance and navigation tasks demonstrate that in the configuration of learning wherein the agent learns from scratch, the inherent exploration model in fusion ART model is comparable to the basic E-greedy policy. On the other hand, the model is demonstrated to deal with prior knowledge and strike a balance between exploration and exploitation.

  4. Separation of time-based and trial-based accounts of the partial reinforcement extinction effect.

    Science.gov (United States)

    Bouton, Mark E; Woods, Amanda M; Todd, Travis P

    2014-01-01

    Two appetitive conditioning experiments with rats examined time-based and trial-based accounts of the partial reinforcement extinction effect (PREE). In the PREE, the loss of responding that occurs in extinction is slower when the conditioned stimulus (CS) has been paired with a reinforcer on some of its presentations (partially reinforced) instead of every presentation (continuously reinforced). According to a time-based or "time-accumulation" view (e.g., Gallistel and Gibbon, 2000), the PREE occurs because the organism has learned in partial reinforcement to expect the reinforcer after a larger amount of time has accumulated in the CS over trials. In contrast, according to a trial-based view (e.g., Capaldi, 1967), the PREE occurs because the organism has learned in partial reinforcement to expect the reinforcer after a larger number of CS presentations. Experiment 1 used a procedure that equated partially and continuously reinforced groups on their expected times to reinforcement during conditioning. A PREE was still observed. Experiment 2 then used an extinction procedure that allowed time in the CS and the number of trials to accumulate differentially through extinction. The PREE was still evident when responding was examined as a function of expected time units to the reinforcer, but was eliminated when responding was examined as a function of expected trial units to the reinforcer. There was no evidence that the animal responded according to the ratio of time accumulated during the CS in extinction over the time in the CS expected before the reinforcer. The results thus favor a trial-based account over a time-based account of extinction and the PREE. This article is part of a Special Issue entitled: Associative and Temporal Learning. Copyright © 2013 Elsevier B.V. All rights reserved.

  5. Dopamine-Dependent Reinforcement of Motor Skill Learning: Evidence from Gilles de la Tourette Syndrome

    Science.gov (United States)

    Palminteri, Stefano; Lebreton, Mael; Worbe, Yulia; Hartmann, Andreas; Lehericy, Stephane; Vidailhet, Marie; Grabli, David; Pessiglione, Mathias

    2011-01-01

    Reinforcement learning theory has been extensively used to understand the neural underpinnings of instrumental behaviour. A central assumption surrounds dopamine signalling reward prediction errors, so as to update action values and ensure better choices in the future. However, educators may share the intuitive idea that reinforcements not only…

  6. Joy, Distress, Hope, and Fear in Reinforcement Learning (Extended Abstract)

    NARCIS (Netherlands)

    Jacobs, E.J.; Broekens, J.; Jonker, C.M.

    2014-01-01

    In this paper we present a mapping between joy, distress, hope and fear, and Reinforcement Learning primitives. Joy / distress is a signal that is derived from the RL update signal, while hope/fear is derived from the utility of the current state. Agent-based simulation experiments replicate

  7. Influence of temporal resolution and processing of exposure data on modeling of chloride ingress and reinforcement corrosion in concrete

    DEFF Research Database (Denmark)

    Flint, Madeleine; Michel, Alexander; Billington, Sarah L.

    2014-01-01

    a numerical heat and mass transport model that includes full coupling of heat, moisture and ion transport. Heat, moisture, and chloride concentration distributions were passed to a simplified reinforcement corrosion initiation and propagation model. The numerical study indicates that processing and temporal...... resolution of the exposure data has a considerable impact on long-term hygrothermal distribution, chloride ingress, and reinforcement section loss results. Use of time-averaged exposure data in the heat and mass transport model reduces the rate of chloride ingress in concrete and affects prediction...

  8. Gaze-contingent reinforcement learning reveals incentive value of social signals in young children and adults.

    Science.gov (United States)

    Vernetti, Angélina; Smith, Tim J; Senju, Atsushi

    2017-03-15

    While numerous studies have demonstrated that infants and adults preferentially orient to social stimuli, it remains unclear as to what drives such preferential orienting. It has been suggested that the learned association between social cues and subsequent reward delivery might shape such social orienting. Using a novel, spontaneous indication of reinforcement learning (with the use of a gaze contingent reward-learning task), we investigated whether children and adults' orienting towards social and non-social visual cues can be elicited by the association between participants' visual attention and a rewarding outcome. Critically, we assessed whether the engaging nature of the social cues influences the process of reinforcement learning. Both children and adults learned to orient more often to the visual cues associated with reward delivery, demonstrating that cue-reward association reinforced visual orienting. More importantly, when the reward-predictive cue was social and engaging, both children and adults learned the cue-reward association faster and more efficiently than when the reward-predictive cue was social but non-engaging. These new findings indicate that social engaging cues have a positive incentive value. This could possibly be because they usually coincide with positive outcomes in real life, which could partly drive the development of social orienting. © 2017 The Authors.

  9. Adaptive Load Balancing of Parallel Applications with Multi-Agent Reinforcement Learning on Heterogeneous Systems

    Directory of Open Access Journals (Sweden)

    Johan Parent

    2004-01-01

    Full Text Available We report on the improvements that can be achieved by applying machine learning techniques, in particular reinforcement learning, for the dynamic load balancing of parallel applications. The applications being considered in this paper are coarse grain data intensive applications. Such applications put high pressure on the interconnect of the hardware. Synchronization and load balancing in complex, heterogeneous networks need fast, flexible, adaptive load balancing algorithms. Viewing a parallel application as a one-state coordination game in the framework of multi-agent reinforcement learning, and by using a recently introduced multi-agent exploration technique, we are able to improve upon the classic job farming approach. The improvements are achieved with limited computation and communication overhead.

  10. Grounding the meanings in sensorimotor behavior using reinforcement learning

    Directory of Open Access Journals (Sweden)

    Igor eFarkaš

    2012-02-01

    Full Text Available The recent outburst of interest in cognitive developmental robotics is fueled by the ambition to propose ecologically plausible mechanisms of how, among other things, a learning agent/robot could ground linguistic meanings in its sensorimotor behaviour. Along this stream, we propose a model that allows the simulated iCub robot to learn the meanings of actions (point, touch and push oriented towards objects in robot's peripersonal space. In our experiments, the iCub learns to execute motor actions and comment on them. Architecturally, the model is composed of three neural-network-based modules that are trained in different ways. The first module, a two-layer perceptron, is trained by back-propagation to attend to the target position in the visual scene, given the low-level visual information and the feature-based target information. The second module, having the form of an actor-critic architecture, is the most distinguishing part of our model, and is trained by a continuous version of reinforcement learning to execute actions as sequences, based on a linguistic command. The third module, an echo-state network, is trained to provide the linguistic description of the executed actions. The trained model generalises well in case of novel action-target combinations with randomised initial arm positions. It can also promptly adapt its behavior if the action/target suddenly changes during motor execution.

  11. Span: spike pattern association neuron for learning spatio-temporal spike patterns.

    Science.gov (United States)

    Mohemmed, Ammar; Schliebs, Stefan; Matsuda, Satoshi; Kasabov, Nikola

    2012-08-01

    Spiking Neural Networks (SNN) were shown to be suitable tools for the processing of spatio-temporal information. However, due to their inherent complexity, the formulation of efficient supervised learning algorithms for SNN is difficult and remains an important problem in the research area. This article presents SPAN - a spiking neuron that is able to learn associations of arbitrary spike trains in a supervised fashion allowing the processing of spatio-temporal information encoded in the precise timing of spikes. The idea of the proposed algorithm is to transform spike trains during the learning phase into analog signals so that common mathematical operations can be performed on them. Using this conversion, it is possible to apply the well-known Widrow-Hoff rule directly to the transformed spike trains in order to adjust the synaptic weights and to achieve a desired input/output spike behavior of the neuron. In the presented experimental analysis, the proposed learning algorithm is evaluated regarding its learning capabilities, its memory capacity, its robustness to noisy stimuli and its classification performance. Differences and similarities of SPAN regarding two related algorithms, ReSuMe and Chronotron, are discussed.

  12. Nonparametric bayesian reward segmentation for skill discovery using inverse reinforcement learning

    CSIR Research Space (South Africa)

    Ranchod, P

    2015-10-01

    Full Text Available We present a method for segmenting a set of unstructured demonstration trajectories to discover reusable skills using inverse reinforcement learning (IRL). Each skill is characterised by a latent reward function which the demonstrator is assumed...

  13. Energy Management Strategy for a Hybrid Electric Vehicle Based on Deep Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Yue Hu

    2018-01-01

    Full Text Available An energy management strategy (EMS is important for hybrid electric vehicles (HEVs since it plays a decisive role on the performance of the vehicle. However, the variation of future driving conditions deeply influences the effectiveness of the EMS. Most existing EMS methods simply follow predefined rules that are not adaptive to different driving conditions online. Therefore, it is useful that the EMS can learn from the environment or driving cycle. In this paper, a deep reinforcement learning (DRL-based EMS is designed such that it can learn to select actions directly from the states without any prediction or predefined rules. Furthermore, a DRL-based online learning architecture is presented. It is significant for applying the DRL algorithm in HEV energy management under different driving conditions. Simulation experiments have been conducted using MATLAB and Advanced Vehicle Simulator (ADVISOR co-simulation. Experimental results validate the effectiveness of the DRL-based EMS compared with the rule-based EMS in terms of fuel economy. The online learning architecture is also proved to be effective. The proposed method ensures the optimality, as well as real-time applicability, in HEVs.

  14. Machine learning methods for planning

    CERN Document Server

    Minton, Steven

    1993-01-01

    Machine Learning Methods for Planning provides information pertinent to learning methods for planning and scheduling. This book covers a wide variety of learning methods and learning architectures, including analogical, case-based, decision-tree, explanation-based, and reinforcement learning.Organized into 15 chapters, this book begins with an overview of planning and scheduling and describes some representative learning systems that have been developed for these tasks. This text then describes a learning apprentice for calendar management. Other chapters consider the problem of temporal credi

  15. Reusable Reinforcement Learning via Shallow Trails.

    Science.gov (United States)

    Yu, Yang; Chen, Shi-Yong; Da, Qing; Zhou, Zhi-Hua

    2018-06-01

    Reinforcement learning has shown great success in helping learning agents accomplish tasks autonomously from environment interactions. Meanwhile in many real-world applications, an agent needs to accomplish not only a fixed task but also a range of tasks. For this goal, an agent can learn a metapolicy over a set of training tasks that are drawn from an underlying distribution. By maximizing the total reward summed over all the training tasks, the metapolicy can then be reused in accomplishing test tasks from the same distribution. However, in practice, we face two major obstacles to train and reuse metapolicies well. First, how to identify tasks that are unrelated or even opposite with each other, in order to avoid their mutual interference in the training. Second, how to characterize task features, according to which a metapolicy can be reused. In this paper, we propose the MetA-Policy LEarning (MAPLE) approach that overcomes the two difficulties by introducing the shallow trail. It probes a task by running a roughly trained policy. Using the rewards of the shallow trail, MAPLE automatically groups similar tasks. Moreover, when the task parameters are unknown, the rewards of the shallow trail also serve as task features. Empirical studies on several controlling tasks verify that MAPLE can train metapolicies well and receives high reward on test tasks.

  16. Reinforcer magnitude and rate dependency: evaluation of resistance-to-change mechanisms.

    Science.gov (United States)

    Pinkston, Jonathan W; Ginsburg, Brett C; Lamb, Richard J

    2014-10-01

    Under many circumstances, reinforcer magnitude appears to modulate the rate-dependent effects of drugs such that when schedules arrange for relatively larger reinforcer magnitudes rate dependency is attenuated compared with behavior maintained by smaller magnitudes. The current literature on resistance to change suggests that increased reinforcer density strengthens operant behavior, and such strengthening effects appear to extend to the temporal control of behavior. As rate dependency may be understood as a loss of temporal control, the effects of reinforcer magnitude on rate dependency may be due to increased resistance to disruption of temporally controlled behavior. In the present experiments, pigeons earned different magnitudes of grain during signaled components of a multiple FI schedule. Three drugs, clonidine, haloperidol, and morphine, were examined. All three decreased overall rates of key pecking; however, only the effects of clonidine were attenuated as reinforcer magnitude increased. An analysis of within-interval performance found rate-dependent effects for clonidine and morphine; however, these effects were not modulated by reinforcer magnitude. In addition, we included prefeeding and extinction conditions, standard tests used to measure resistance to change. In general, rate-decreasing effects of prefeeding and extinction were attenuated by increasing reinforcer magnitudes. Rate-dependent analyses of prefeeding showed rate-dependency following those tests, but in no case were these effects modulated by reinforcer magnitude. The results suggest that a resistance-to-change interpretation of the effects of reinforcer magnitude on rate dependency is not viable.

  17. Manufacturing Scheduling Using Colored Petri Nets and Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Maria Drakaki

    2017-02-01

    Full Text Available Agent-based intelligent manufacturing control systems are capable to efficiently respond and adapt to environmental changes. Manufacturing system adaptation and evolution can be addressed with learning mechanisms that increase the intelligence of agents. In this paper a manufacturing scheduling method is presented based on Timed Colored Petri Nets (CTPNs and reinforcement learning (RL. CTPNs model the manufacturing system and implement the scheduling. In the search for an optimal solution a scheduling agent uses RL and in particular the Q-learning algorithm. A warehouse order-picking scheduling is presented as a case study to illustrate the method. The proposed scheduling method is compared to existing methods. Simulation and state space results are used to evaluate performance and identify system properties.

  18. An Improved Reinforcement Learning System Using Affective Factors

    Directory of Open Access Journals (Sweden)

    Takashi Kuremoto

    2013-07-01

    Full Text Available As a powerful and intelligent machine learning method, reinforcement learning (RL has been widely used in many fields such as game theory, adaptive control, multi-agent system, nonlinear forecasting, and so on. The main contribution of this technique is its exploration and exploitation approaches to find the optimal solution or semi-optimal solution of goal-directed problems. However, when RL is applied to multi-agent systems (MASs, problems such as “curse of dimension”, “perceptual aliasing problem”, and uncertainty of the environment constitute high hurdles to RL. Meanwhile, although RL is inspired by behavioral psychology and reward/punishment from the environment is used, higher mental factors such as affects, emotions, and motivations are rarely adopted in the learning procedure of RL. In this paper, to challenge agents learning in MASs, we propose a computational motivation function, which adopts two principle affective factors “Arousal” and “Pleasure” of Russell’s circumplex model of affects, to improve the learning performance of a conventional RL algorithm named Q-learning (QL. Compared with the conventional QL, computer simulations of pursuit problems with static and dynamic preys were carried out, and the results showed that the proposed method results in agents having a faster and more stable learning performance.

  19. Optimizing Chemical Reactions with Deep Reinforcement Learning.

    Science.gov (United States)

    Zhou, Zhenpeng; Li, Xiaocheng; Zare, Richard N

    2017-12-27

    Deep reinforcement learning was employed to optimize chemical reactions. Our model iteratively records the results of a chemical reaction and chooses new experimental conditions to improve the reaction outcome. This model outperformed a state-of-the-art blackbox optimization algorithm by using 71% fewer steps on both simulations and real reactions. Furthermore, we introduced an efficient exploration strategy by drawing the reaction conditions from certain probability distributions, which resulted in an improvement on regret from 0.062 to 0.039 compared with a deterministic policy. Combining the efficient exploration policy with accelerated microdroplet reactions, optimal reaction conditions were determined in 30 min for the four reactions considered, and a better understanding of the factors that control microdroplet reactions was reached. Moreover, our model showed a better performance after training on reactions with similar or even dissimilar underlying mechanisms, which demonstrates its learning ability.

  20. 'Proactive' use of cue-context congruence for building reinforcement learning's reward function.

    Science.gov (United States)

    Zsuga, Judit; Biro, Klara; Tajti, Gabor; Szilasi, Magdolna Emma; Papp, Csaba; Juhasz, Bela; Gesztelyi, Rudolf

    2016-10-28

    Reinforcement learning is a fundamental form of learning that may be formalized using the Bellman equation. Accordingly an agent determines the state value as the sum of immediate reward and of the discounted value of future states. Thus the value of state is determined by agent related attributes (action set, policy, discount factor) and the agent's knowledge of the environment embodied by the reward function and hidden environmental factors given by the transition probability. The central objective of reinforcement learning is to solve these two functions outside the agent's control either using, or not using a model. In the present paper, using the proactive model of reinforcement learning we offer insight on how the brain creates simplified representations of the environment, and how these representations are organized to support the identification of relevant stimuli and action. Furthermore, we identify neurobiological correlates of our model by suggesting that the reward and policy functions, attributes of the Bellman equitation, are built by the orbitofrontal cortex (OFC) and the anterior cingulate cortex (ACC), respectively. Based on this we propose that the OFC assesses cue-context congruence to activate the most context frame. Furthermore given the bidirectional neuroanatomical link between the OFC and model-free structures, we suggest that model-based input is incorporated into the reward prediction error (RPE) signal, and conversely RPE signal may be used to update the reward-related information of context frames and the policy underlying action selection in the OFC and ACC, respectively. Furthermore clinical implications for cognitive behavioral interventions are discussed.

  1. The "proactive" model of learning: Integrative framework for model-free and model-based reinforcement learning utilizing the associative learning-based proactive brain concept.

    Science.gov (United States)

    Zsuga, Judit; Biro, Klara; Papp, Csaba; Tajti, Gabor; Gesztelyi, Rudolf

    2016-02-01

    Reinforcement learning (RL) is a powerful concept underlying forms of associative learning governed by the use of a scalar reward signal, with learning taking place if expectations are violated. RL may be assessed using model-based and model-free approaches. Model-based reinforcement learning involves the amygdala, the hippocampus, and the orbitofrontal cortex (OFC). The model-free system involves the pedunculopontine-tegmental nucleus (PPTgN), the ventral tegmental area (VTA) and the ventral striatum (VS). Based on the functional connectivity of VS, model-free and model based RL systems center on the VS that by integrating model-free signals (received as reward prediction error) and model-based reward related input computes value. Using the concept of reinforcement learning agent we propose that the VS serves as the value function component of the RL agent. Regarding the model utilized for model-based computations we turned to the proactive brain concept, which offers an ubiquitous function for the default network based on its great functional overlap with contextual associative areas. Hence, by means of the default network the brain continuously organizes its environment into context frames enabling the formulation of analogy-based association that are turned into predictions of what to expect. The OFC integrates reward-related information into context frames upon computing reward expectation by compiling stimulus-reward and context-reward information offered by the amygdala and hippocampus, respectively. Furthermore we suggest that the integration of model-based expectations regarding reward into the value signal is further supported by the efferent of the OFC that reach structures canonical for model-free learning (e.g., the PPTgN, VTA, and VS). (c) 2016 APA, all rights reserved).

  2. Tank War Using Online Reinforcement Learning

    DEFF Research Database (Denmark)

    Toftgaard Andersen, Kresten; Zeng, Yifeng; Dahl Christensen, Dennis

    2009-01-01

    Real-Time Strategy(RTS) games provide a challenging platform to implement online reinforcement learning(RL) techniques in a real application. Computer as one player monitors opponents'(human or other computers) strategies and then updates its own policy using RL methods. In this paper, we propose...... a multi-layer framework for implementing the online RL in a RTS game. The framework significantly reduces the RL computational complexity by decomposing the state space in a hierarchical manner. We implement the RTS game - Tank General, and perform a thorough test on the proposed framework. The results...... show the effectiveness of our proposed framework and shed light on relevant issues on using the RL in RTS games....

  3. Strength of Temporal White Matter Pathways Predicts Semantic Learning.

    Science.gov (United States)

    Ripollés, Pablo; Biel, Davina; Peñaloza, Claudia; Kaufmann, Jörn; Marco-Pallarés, Josep; Noesselt, Toemme; Rodríguez-Fornells, Antoni

    2017-11-15

    Learning the associations between words and meanings is a fundamental human ability. Although the language network is cortically well defined, the role of the white matter pathways supporting novel word-to-meaning mappings remains unclear. Here, by using contextual and cross-situational word learning, we tested whether learning the meaning of a new word is related to the integrity of the language-related white matter pathways in 40 adults (18 women). The arcuate, uncinate, inferior-fronto-occipital and inferior-longitudinal fasciculi were virtually dissected using manual and automatic deterministic fiber tracking. Critically, the automatic method allowed assessing the white matter microstructure along the tract. Results demonstrate that the microstructural properties of the left inferior-longitudinal fasciculus predict contextual learning, whereas the left uncinate was associated with cross-situational learning. In addition, we identified regions of special importance within these pathways: the posterior middle temporal gyrus, thought to serve as a lexical interface and specifically related to contextual learning; the anterior temporal lobe, known to be an amodal hub for semantic processing and related to cross-situational learning; and the white matter near the hippocampus, a structure fundamental for the initial stages of new-word learning and, remarkably, related to both types of word learning. No significant associations were found for the inferior-fronto-occipital fasciculus or the arcuate. While previous results suggest that learning new phonological word forms is mediated by the arcuate fasciculus, these findings show that the temporal pathways are the crucial neural substrate supporting one of the most striking human abilities: our capacity to identify correct associations between words and meanings under referential indeterminacy. SIGNIFICANCE STATEMENT The language-processing network is cortically (i.e., gray matter) well defined. However, the role of the

  4. A Model to Explain the Emergence of Reward Expectancy neurons using Reinforcement Learning and Neural Network

    OpenAIRE

    Shinya, Ishii; Munetaka, Shidara; Katsunari, Shibata

    2006-01-01

    In an experiment of multi-trial task to obtain a reward, reward expectancy neurons,###which responded only in the non-reward trials that are necessary to advance###toward the reward, have been observed in the anterior cingulate cortex of monkeys.###In this paper, to explain the emergence of the reward expectancy neuron in###terms of reinforcement learning theory, a model that consists of a recurrent neural###network trained based on reinforcement learning is proposed. The analysis of the###hi...

  5. FMRQ-A Multiagent Reinforcement Learning Algorithm for Fully Cooperative Tasks.

    Science.gov (United States)

    Zhang, Zhen; Zhao, Dongbin; Gao, Junwei; Wang, Dongqing; Dai, Yujie

    2017-06-01

    In this paper, we propose a multiagent reinforcement learning algorithm dealing with fully cooperative tasks. The algorithm is called frequency of the maximum reward Q-learning (FMRQ). FMRQ aims to achieve one of the optimal Nash equilibria so as to optimize the performance index in multiagent systems. The frequency of obtaining the highest global immediate reward instead of immediate reward is used as the reinforcement signal. With FMRQ each agent does not need the observation of the other agents' actions and only shares its state and reward at each step. We validate FMRQ through case studies of repeated games: four cases of two-player two-action and one case of three-player two-action. It is demonstrated that FMRQ can converge to one of the optimal Nash equilibria in these cases. Moreover, comparison experiments on tasks with multiple states and finite steps are conducted. One is box-pushing and the other one is distributed sensor network problem. Experimental results show that the proposed algorithm outperforms others with higher performance.

  6. Knowledge-Based Reinforcement Learning for Data Mining

    Science.gov (United States)

    Kudenko, Daniel; Grzes, Marek

    Data Mining is the process of extracting patterns from data. Two general avenues of research in the intersecting areas of agents and data mining can be distinguished. The first approach is concerned with mining an agent’s observation data in order to extract patterns, categorize environment states, and/or make predictions of future states. In this setting, data is normally available as a batch, and the agent’s actions and goals are often independent of the data mining task. The data collection is mainly considered as a side effect of the agent’s activities. Machine learning techniques applied in such situations fall into the class of supervised learning. In contrast, the second scenario occurs where an agent is actively performing the data mining, and is responsible for the data collection itself. For example, a mobile network agent is acquiring and processing data (where the acquisition may incur a certain cost), or a mobile sensor agent is moving in a (perhaps hostile) environment, collecting and processing sensor readings. In these settings, the tasks of the agent and the data mining are highly intertwined and interdependent (or even identical). Supervised learning is not a suitable technique for these cases. Reinforcement Learning (RL) enables an agent to learn from experience (in form of reward and punishment for explorative actions) and adapt to new situations, without a teacher. RL is an ideal learning technique for these data mining scenarios, because it fits the agent paradigm of continuous sensing and acting, and the RL agent is able to learn to make decisions on the sampling of the environment which provides the data. Nevertheless, RL still suffers from scalability problems, which have prevented its successful use in many complex real-world domains. The more complex the tasks, the longer it takes a reinforcement learning algorithm to converge to a good solution. For many real-world tasks, human expert knowledge is available. For example, human

  7. Reinforcement Learning with Autonomous Small Unmanned Aerial Vehicles in Cluttered Environments

    Science.gov (United States)

    Tran, Loc; Cross, Charles; Montague, Gilbert; Motter, Mark; Neilan, James; Qualls, Garry; Rothhaar, Paul; Trujillo, Anna; Allen, B. Danette

    2015-01-01

    We present ongoing work in the Autonomy Incubator at NASA Langley Research Center (LaRC) exploring the efficacy of a data set aggregation approach to reinforcement learning for small unmanned aerial vehicle (sUAV) flight in dense and cluttered environments with reactive obstacle avoidance. The goal is to learn an autonomous flight model using training experiences from a human piloting a sUAV around static obstacles. The training approach uses video data from a forward-facing camera that records the human pilot's flight. Various computer vision based features are extracted from the video relating to edge and gradient information. The recorded human-controlled inputs are used to train an autonomous control model that correlates the extracted feature vector to a yaw command. As part of the reinforcement learning approach, the autonomous control model is iteratively updated with feedback from a human agent who corrects undesired model output. This data driven approach to autonomous obstacle avoidance is explored for simulated forest environments furthering autonomous flight under the tree canopy research. This enables flight in previously inaccessible environments which are of interest to NASA researchers in Earth and Atmospheric sciences.

  8. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation.

    Science.gov (United States)

    Kato, Ayaka; Morita, Kenji

    2016-10-01

    It has been suggested that dopamine (DA) represents reward-prediction-error (RPE) defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of 'Go' or 'No-Go' selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1) decay-induced sustained RPE creates a gradient of 'Go' values towards a goal, and (2) value-contrasts between 'Go' and 'No-Go' are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i) slowdown of behavior by post-training blockade of DA signaling, (ii) observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii) relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological systems for value-learning

  9. Forgetting in Reinforcement Learning Links Sustained Dopamine Signals to Motivation.

    Directory of Open Access Journals (Sweden)

    Ayaka Kato

    2016-10-01

    Full Text Available It has been suggested that dopamine (DA represents reward-prediction-error (RPE defined in reinforcement learning and therefore DA responds to unpredicted but not predicted reward. However, recent studies have found DA response sustained towards predictable reward in tasks involving self-paced behavior, and suggested that this response represents a motivational signal. We have previously shown that RPE can sustain if there is decay/forgetting of learned-values, which can be implemented as decay of synaptic strengths storing learned-values. This account, however, did not explain the suggested link between tonic/sustained DA and motivation. In the present work, we explored the motivational effects of the value-decay in self-paced approach behavior, modeled as a series of 'Go' or 'No-Go' selections towards a goal. Through simulations, we found that the value-decay can enhance motivation, specifically, facilitate fast goal-reaching, albeit counterintuitively. Mathematical analyses revealed that underlying potential mechanisms are twofold: (1 decay-induced sustained RPE creates a gradient of 'Go' values towards a goal, and (2 value-contrasts between 'Go' and 'No-Go' are generated because while chosen values are continually updated, unchosen values simply decay. Our model provides potential explanations for the key experimental findings that suggest DA's roles in motivation: (i slowdown of behavior by post-training blockade of DA signaling, (ii observations that DA blockade severely impairs effortful actions to obtain rewards while largely sparing seeking of easily obtainable rewards, and (iii relationships between the reward amount, the level of motivation reflected in the speed of behavior, and the average level of DA. These results indicate that reinforcement learning with value-decay, or forgetting, provides a parsimonious mechanistic account for the DA's roles in value-learning and motivation. Our results also suggest that when biological systems

  10. Vicarious reinforcement learning signals when instructing others.

    Science.gov (United States)

    Apps, Matthew A J; Lesage, Elise; Ramnani, Narender

    2015-02-18

    Reinforcement learning (RL) theory posits that learning is driven by discrepancies between the predicted and actual outcomes of actions (prediction errors [PEs]). In social environments, learning is often guided by similar RL mechanisms. For example, teachers monitor the actions of students and provide feedback to them. This feedback evokes PEs in students that guide their learning. We report the first study that investigates the neural mechanisms that underpin RL signals in the brain of a teacher. Neurons in the anterior cingulate cortex (ACC) signal PEs when learning from the outcomes of one's own actions but also signal information when outcomes are received by others. Does a teacher's ACC signal PEs when monitoring a student's learning? Using fMRI, we studied brain activity in human subjects (teachers) as they taught a confederate (student) action-outcome associations by providing positive or negative feedback. We examined activity time-locked to the students' responses, when teachers infer student predictions and know actual outcomes. We fitted a RL-based computational model to the behavior of the student to characterize their learning, and examined whether a teacher's ACC signals when a student's predictions are wrong. In line with our hypothesis, activity in the teacher's ACC covaried with the PE values in the model. Additionally, activity in the teacher's insula and ventromedial prefrontal cortex covaried with the predicted value according to the student. Our findings highlight that the ACC signals PEs vicariously for others' erroneous predictions, when monitoring and instructing their learning. These results suggest that RL mechanisms, processed vicariously, may underpin and facilitate teaching behaviors. Copyright © 2015 Apps et al.

  11. Amygdala and ventral striatum make distinct contributions to reinforcement learning

    Science.gov (United States)

    Costa, Vincent D.; Monte, Olga Dal; Lucas, Daniel R.; Murray, Elisabeth A.; Averbeck, Bruno B.

    2016-01-01

    Summary Reinforcement learning (RL) theories posit that dopaminergic signals are integrated within the striatum to associate choices with outcomes. Often overlooked is that the amygdala also receives dopaminergic input and is involved in Pavlovian processes that influence choice behavior. To determine the relative contributions of the ventral striatum (VS) and amygdala to appetitive RL we tested rhesus macaques with VS or amygdala lesions on deterministic and stochastic versions of a two-arm bandit reversal learning task. When learning was characterized with a RL model relative to controls, amygdala lesions caused general decreases in learning from positive feedback and choice consistency. By comparison, VS lesions only affected learning in the stochastic task. Moreover, the VS lesions hastened the monkeys’ choice reaction times, which emphasized a speed-accuracy tradeoff that accounted for errors in deterministic learning. These results update standard accounts of RL by emphasizing distinct contributions of the amygdala and VS to RL. PMID:27720488

  12. Active-learning strategies: the use of a game to reinforce learning in nursing education. A case study.

    Science.gov (United States)

    Boctor, Lisa

    2013-03-01

    The majority of nursing students are kinesthetic learners, preferring a hands-on, active approach to education. Research shows that active-learning strategies can increase student learning and satisfaction. This study looks at the use of one active-learning strategy, a Jeopardy-style game, 'Nursopardy', to reinforce Fundamentals of Nursing material, aiding in students' preparation for a standardized final exam. The game was created keeping students varied learning styles and the NCLEX blueprint in mind. The blueprint was used to create 5 categories, with 26 total questions. Student survey results, using a five-point Likert scale showed that they did find this learning method enjoyable and beneficial to learning. More research is recommended regarding learning outcomes, when using active-learning strategies, such as games. Copyright © 2012 Elsevier Ltd. All rights reserved.

  13. Adolescent-specific patterns of behavior and neural activity during social reinforcement learning

    OpenAIRE

    Jones, Rebecca M.; Somerville, Leah H.; Li, Jian; Ruberry, Erika J.; Powers, Alisa; Mehta, Natasha; Dyke, Jonathan; Casey, BJ

    2014-01-01

    Humans are sophisticated social beings. Social cues from others are exceptionally salient, particularly during adolescence. Understanding how adolescents interpret and learn from variable social signals can provide insight into the observed shift in social sensitivity during this period. The current study tested 120 participants between the ages of 8 and 25 years on a social reinforcement learning task where the probability of receiving positive social feedback was parametrically manipulated....

  14. Spared internal but impaired external reward prediction error signals in major depressive disorder during reinforcement learning.

    Science.gov (United States)

    Bakic, Jasmina; Pourtois, Gilles; Jepma, Marieke; Duprat, Romain; De Raedt, Rudi; Baeken, Chris

    2017-01-01

    Major depressive disorder (MDD) creates debilitating effects on a wide range of cognitive functions, including reinforcement learning (RL). In this study, we sought to assess whether reward processing as such, or alternatively the complex interplay between motivation and reward might potentially account for the abnormal reward-based learning in MDD. A total of 35 treatment resistant MDD patients and 44 age matched healthy controls (HCs) performed a standard probabilistic learning task. RL was titrated using behavioral, computational modeling and event-related brain potentials (ERPs) data. MDD patients showed comparable learning rate compared to HCs. However, they showed decreased lose-shift responses as well as blunted subjective evaluations of the reinforcers used during the task, relative to HCs. Moreover, MDD patients showed normal internal (at the level of error-related negativity, ERN) but abnormal external (at the level of feedback-related negativity, FRN) reward prediction error (RPE) signals during RL, selectively when additional efforts had to be made to establish learning. Collectively, these results lend support to the assumption that MDD does not impair reward processing per se during RL. Instead, it seems to alter the processing of the emotional value of (external) reinforcers during RL, when additional intrinsic motivational processes have to be engaged. © 2016 Wiley Periodicals, Inc.

  15. Emotion in reinforcement learning agents and robots: A survey

    OpenAIRE

    Moerland, T.M.; Broekens, D.J.; Jonker, C.M.

    2018-01-01

    This article provides the first survey of computational models of emotion in reinforcement learning (RL) agents. The survey focuses on agent/robot emotions, and mostly ignores human user emotions. Emotions are recognized as functional in decision-making by influencing motivation and action selection. Therefore, computational emotion models are usually grounded in the agent's decision making architecture, of which RL is an important subclass. Studying emotions in RL-based agents is useful for ...

  16. Effect of quinolinic acid-induced lesions of the nucleus accumbens core on performance on a progressive ratio schedule of reinforcement: implications for inter-temporal choice.

    Science.gov (United States)

    Bezzina, G; Body, S; Cheung, T H C; Hampson, C L; Deakin, J F W; Anderson, I M; Szabadi, E; Bradshaw, C M

    2008-04-01

    The nucleus accumbens core (AcbC) is believed to contribute to the control of operant behaviour by reinforcers. Recent evidence suggests that it is not crucial for determining the incentive value of immediately available reinforcers, but is important for maintaining the values of delayed reinforcers. This study aims to examine the effect of AcbC lesions on performance on a progressive-ratio schedule using a quantitative model that dissociates effects of interventions on motor and motivational processes (Killeen 1994 Mathematical principles of reinforcement. Behav Brain Sci 17:105-172). Rats with bilateral quinolinic acid-induced lesions of the AcbC (n = 15) or sham lesions (n = 14) were trained to lever-press for food-pellet reinforcers under a progressive-ratio schedule. In Phase 1 (90 sessions) the reinforcer was one pellet; in Phase 2 (30 sessions), it was two pellets; in Phase 3, (30 sessions) it was one pellet. The performance of both groups conformed to the model of progressive-ratio performance (group mean data: r2 > 0.92). The motor parameter, delta, was significantly higher in the AcbC-lesioned than the sham-lesioned group, reflecting lower overall response rates in the lesioned group. The motivational parameter, a, was sensitive to changes in reinforcer size, but did not differ significantly between the two groups. The AcbC-lesioned group showed longer post-reinforcement pauses and lower running response rates than the sham-lesioned group. The results suggest that destruction of the AcbC impairs response capacity but does not alter the efficacy of food reinforcers. The results are consistent with recent findings that AcbC lesions do not alter sensitivity to reinforcer size in inter-temporal choice schedules.

  17. An analysis of intergroup rivalry using Ising model and reinforcement learning

    Science.gov (United States)

    Zhao, Feng-Fei; Qin, Zheng; Shao, Zhuo

    2014-01-01

    Modeling of intergroup rivalry can help us better understand economic competitions, political elections and other similar activities. The result of intergroup rivalry depends on the co-evolution of individual behavior within one group and the impact from the rival group. In this paper, we model the rivalry behavior using Ising model. Different from other simulation studies using Ising model, the evolution rules of each individual in our model are not static, but have the ability to learn from historical experience using reinforcement learning technique, which makes the simulation more close to real human behavior. We studied the phase transition in intergroup rivalry and focused on the impact of the degree of social freedom, the personality of group members and the social experience of individuals. The results of computer simulation show that a society with a low degree of social freedom and highly educated, experienced individuals is more likely to be one-sided in intergroup rivalry.

  18. SPAN: spike pattern association neuron for learning spatio-temporal sequences

    OpenAIRE

    Mohemmed, A; Schliebs, S; Matsuda, S; Kasabov, N

    2012-01-01

    Spiking Neural Networks (SNN) were shown to be suitable tools for the processing of spatio-temporal information. However, due to their inherent complexity, the formulation of efficient supervised learning algorithms for SNN is difficult and remains an important problem in the research area. This article presents SPAN — a spiking neuron that is able to learn associations of arbitrary spike trains in a supervised fashion allowing the processing of spatio-temporal information encoded in the prec...

  19. Decision Making in Reinforcement Learning Using a Modified Learning Space Based on the Importance of Sensors

    Directory of Open Access Journals (Sweden)

    Yasutaka Kishima

    2013-01-01

    Full Text Available Many studies have been conducted on the application of reinforcement learning (RL to robots. A robot which is made for general purpose has redundant sensors or actuators because it is difficult to assume an environment that the robot will face and a task that the robot must execute. In this case, -space on RL contains redundancy so that the robot must take much time to learn a given task. In this study, we focus on the importance of sensors with regard to a robot’s performance of a particular task. The sensors that are applicable to a task differ according to the task. By using the importance of the sensors, we try to adjust the state number of the sensors and to reduce the size of -space. In this paper, we define the measure of importance of a sensor for a task with the correlation between the value of each sensor and reward. A robot calculates the importance of the sensors and makes the size of -space smaller. We propose the method which reduces learning space and construct the learning system by putting it in RL. In this paper, we confirm the effectiveness of our proposed system with an experimental robot.

  20. A Computational Model of the Temporal Dynamics of Plasticity in Procedural Learning: Sensitivity to Feedback Timing

    Directory of Open Access Journals (Sweden)

    Vivian V. Valentin

    2014-07-01

    Full Text Available The evidence is now good that different memory systems mediate the learning of different types of category structures. In particular, declarative memory dominates rule-based (RB category learning and procedural memory dominates information-integration (II category learning. For example, several studies have reported that feedback timing is critical for II category learning, but not for RB category learning – results that have broad support within the memory systems literature. Specifically, II category learning has been shown to be best with feedback delays of 500ms compared to delays of 0 and 1000ms, and highly impaired with delays of 2.5 seconds or longer. In contrast, RB learning is unaffected by any feedback delay up to 10 seconds. We propose a neurobiologically detailed theory of procedural learning that is sensitive to different feedback delays. The theory assumes that procedural learning is mediated by plasticity at cortical-striatal synapses that are modified by dopamine-mediated reinforcement learning. The model captures the time-course of the biochemical events in the striatum that cause synaptic plasticity, and thereby accounts for the empirical effects of various feedback delays on II category learning.

  1. Learning and memory and its relationship with the lateralization of epileptic focus in subjects with temporal lobe epilepsy

    Directory of Open Access Journals (Sweden)

    Daniel Fuentes

    2014-04-01

    Full Text Available Background : In medial temporal lobe epilepsy (MTLE, previous studies addressing the hemispheric laterality of epileptogenic focus and its relationship with learning and memory processes have reported controversial findings. Objective : To compare the performance of MTLE patients according to the location of the epileptogenic focus on the left (MTLEL or right temporal lobe (MTLER on tasks of episodic learning and memory for verbal and visual content. Methods : One hundred patients with MTLEL and one hundred patients with MTLER were tested with the following tasks: the Rey Auditory Verbal Learning Test (RAVLT and the Logical Memory-WMS-R to evaluate verbal learning and memory; and the Rey Visual Design Learning Test (RVDLT and the Visual Reproduction-WMS-R to evaluate visual learning and memory. Results : The MTLEL sample showed significantly worse performance on the RAVLT (p < 0.005 and on the Logical Memory tests (p < 0.01 than MTLER subjects. However, there were no significant between-group differences in regard to the visual memory tests. Discussion : Our findings suggest that verbal learning and memory abilities are dependent on the structural and functional integrity of the left temporal lobe, while visual abilities are less dependent on the right temporal lobe.

  2. Reinforcement Learning in Distributed Domains: Beyond Team Games

    Science.gov (United States)

    Wolpert, David H.; Sill, Joseph; Turner, Kagan

    2000-01-01

    Distributed search algorithms are crucial in dealing with large optimization problems, particularly when a centralized approach is not only impractical but infeasible. Many machine learning concepts have been applied to search algorithms in order to improve their effectiveness. In this article we present an algorithm that blends Reinforcement Learning (RL) and hill climbing directly, by using the RL signal to guide the exploration step of a hill climbing algorithm. We apply this algorithm to the domain of a constellations of communication satellites where the goal is to minimize the loss of importance weighted data. We introduce the concept of 'ghost' traffic, where correctly setting this traffic induces the satellites to act to optimize the world utility. Our results indicated that the bi-utility search introduced in this paper outperforms both traditional hill climbing algorithms and distributed RL approaches such as team games.

  3. Cerebellar and prefrontal cortex contributions to adaptation, strategies, and reinforcement learning.

    Science.gov (United States)

    Taylor, Jordan A; Ivry, Richard B

    2014-01-01

    Traditionally, motor learning has been studied as an implicit learning process, one in which movement errors are used to improve performance in a continuous, gradual manner. The cerebellum figures prominently in this literature given well-established ideas about the role of this system in error-based learning and the production of automatized skills. Recent developments have brought into focus the relevance of multiple learning mechanisms for sensorimotor learning. These include processes involving repetition, reinforcement learning, and strategy utilization. We examine these developments, considering their implications for understanding cerebellar function and how this structure interacts with other neural systems to support motor learning. Converging lines of evidence from behavioral, computational, and neuropsychological studies suggest a fundamental distinction between processes that use error information to improve action execution or action selection. While the cerebellum is clearly linked to the former, its role in the latter remains an open question. © 2014 Elsevier B.V. All rights reserved.

  4. Learning of spatio-temporal codes in a coupled oscillator system.

    Science.gov (United States)

    Orosz, Gábor; Ashwin, Peter; Townley, Stuart

    2009-07-01

    In this paper, we consider a learning strategy that allows one to transmit information between two coupled phase oscillator systems (called teaching and learning systems) via frequency adaptation. The dynamics of these systems can be modeled with reference to a number of partially synchronized cluster states and transitions between them. Forcing the teaching system by steady but spatially nonhomogeneous inputs produces cyclic sequences of transitions between the cluster states, that is, information about inputs is encoded via a "winnerless competition" process into spatio-temporal codes. The large variety of codes can be learned by the learning system that adapts its frequencies to those of the teaching system. We visualize the dynamics using "weighted order parameters (WOPs)" that are analogous to "local field potentials" in neural systems. Since spatio-temporal coding is a mechanism that appears in olfactory systems, the developed learning rules may help to extract information from these neural ensembles.

  5. Reinforcement learning of self-regulated β-oscillations for motor restoration in chronic stroke

    Directory of Open Access Journals (Sweden)

    Georgios eNaros

    2015-07-01

    Full Text Available Neurofeedback training of motor imagery-related brain-states with brain-machine interfaces (BMI is currently being explored prior to standard physiotherapy to improve the motor outcome of stroke rehabilitation. Pilot studies suggest that such a priming intervention before physiotherapy might increase the responsiveness of the brain to the subsequent physiotherapy, thereby improving the clinical outcome. However, there is little evidence up to now that these BMI-based interventions have achieved operate conditioning of specific brain states that facilitate task-specific functional gains beyond the practice of primed physiotherapy. In this context, we argue that BMI technology needs to aim at physiological features relevant for the targeted behavioral gain. Moreover, this therapeutic intervention has to be informed by concepts of reinforcement learning to develop its full potential. Such a refined neurofeedback approach would need to address the following issues (1 Defining a physiological feedback target specific to the intended behavioral gain, e.g. β-band oscillations for cortico-muscular communication. This targeted brain state could well be different from the brain state optimal for the neurofeedback task (2 Selecting a BMI classification and thresholding approach on the basis of learning principles, i.e. balancing challenge and reward of the neurofeedback task instead of maximizing the classification accuracy of the feedback device (3 Adjusting the feedback in the course of the training period to account for the cognitive load and the learning experience of the participant. The proposed neurofeedback strategy provides evidence for the feasibility of the suggested approach by demonstrating that dynamic threshold adaptation based on reinforcement learning may lead to frequency-specific operant conditioning of β-band oscillations paralleled by task-specific motor improvement; a proposal that requires investigation in a larger cohort of stroke

  6. Supervised Learning in Spiking Neural Networks for Precise Temporal Encoding.

    Science.gov (United States)

    Gardner, Brian; Grüning, André

    2016-01-01

    Precise spike timing as a means to encode information in neural networks is biologically supported, and is advantageous over frequency-based codes by processing input features on a much shorter time-scale. For these reasons, much recent attention has been focused on the development of supervised learning rules for spiking neural networks that utilise a temporal coding scheme. However, despite significant progress in this area, there still lack rules that have a theoretical basis, and yet can be considered biologically relevant. Here we examine the general conditions under which synaptic plasticity most effectively takes place to support the supervised learning of a precise temporal code. As part of our analysis we examine two spike-based learning methods: one of which relies on an instantaneous error signal to modify synaptic weights in a network (INST rule), and the other one relying on a filtered error signal for smoother synaptic weight modifications (FILT rule). We test the accuracy of the solutions provided by each rule with respect to their temporal encoding precision, and then measure the maximum number of input patterns they can learn to memorise using the precise timings of individual spikes as an indication of their storage capacity. Our results demonstrate the high performance of the FILT rule in most cases, underpinned by the rule's error-filtering mechanism, which is predicted to provide smooth convergence towards a desired solution during learning. We also find the FILT rule to be most efficient at performing input pattern memorisations, and most noticeably when patterns are identified using spikes with sub-millisecond temporal precision. In comparison with existing work, we determine the performance of the FILT rule to be consistent with that of the highly efficient E-learning Chronotron rule, but with the distinct advantage that our FILT rule is also implementable as an online method for increased biological realism.

  7. Rats do not respond differently in the presence of stimuli signaling wheel-running reinforcers of different durations.

    Science.gov (United States)

    Belke, Terry W

    2007-05-01

    Rats were exposed to a fixed interval 30 s schedule that produced opportunities to run of equal or unequal durations to assess the effect of differences in duration on responding. Each duration was signaled by a different stimulus. Wheel-running reinforcer duration pairs were 30 s 30 s, 50 s 10 s, and 55 s 5 s. An analysis of median postreinforcement pause duration and mean local lever-pressing rates broken down by previous reinforcer duration and duration of signaled upcoming reinforcer showed that postreinforcement pause duration was affected by the duration of the previous reinforcer but not by the stimulus signaling the duration of the upcoming reinforcer. Local lever-pressing rates were not affected by either previous or upcoming reinforcer duration. In general, the results are consistent with indifference between these durations obtained using a concurrent choice procedure.

  8. Multiobjective Reinforcement Learning for Traffic Signal Control Using Vehicular Ad Hoc Network

    Directory of Open Access Journals (Sweden)

    Houli Duan

    2010-01-01

    Full Text Available We propose a new multiobjective control algorithm based on reinforcement learning for urban traffic signal control, named multi-RL. A multiagent structure is used to describe the traffic system. A vehicular ad hoc network is used for the data exchange among agents. A reinforcement learning algorithm is applied to predict the overall value of the optimization objective given vehicles' states. The policy which minimizes the cumulative value of the optimization objective is regarded as the optimal one. In order to make the method adaptive to various traffic conditions, we also introduce a multiobjective control scheme in which the optimization objective is selected adaptively to real-time traffic states. The optimization objectives include the vehicle stops, the average waiting time, and the maximum queue length of the next intersection. In addition, we also accommodate a priority control to the buses and the emergency vehicles through our model. The simulation results indicated that our algorithm could perform more efficiently than traditional traffic light control methods.

  9. The Study of Reinforcement Learning for Traffic Self-Adaptive Control under Multiagent Markov Game Environment

    Directory of Open Access Journals (Sweden)

    Lun-Hui Xu

    2013-01-01

    Full Text Available Urban traffic self-adaptive control problem is dynamic and uncertain, so the states of traffic environment are hard to be observed. Efficient agent which controls a single intersection can be discovered automatically via multiagent reinforcement learning. However, in the majority of the previous works on this approach, each agent needed perfect observed information when interacting with the environment and learned individually with less efficient coordination. This study casts traffic self-adaptive control as a multiagent Markov game problem. The design employs traffic signal control agent (TSCA for each signalized intersection that coordinates with neighboring TSCAs. A mathematical model for TSCAs’ interaction is built based on nonzero-sum markov game which has been applied to let TSCAs learn how to cooperate. A multiagent Markov game reinforcement learning approach is constructed on the basis of single-agent Q-learning. This method lets each TSCA learn to update its Q-values under the joint actions and imperfect information. The convergence of the proposed algorithm is analyzed theoretically. The simulation results show that the proposed method is convergent and effective in realistic traffic self-adaptive control setting.

  10. TensorFlow Agents: Efficient Batched Reinforcement Learning in TensorFlow

    OpenAIRE

    Hafner, Danijar; Davidson, James; Vanhoucke, Vincent

    2017-01-01

    We introduce TensorFlow Agents, an efficient infrastructure paradigm for building parallel reinforcement learning algorithms in TensorFlow. We simulate multiple environments in parallel, and group them to perform the neural network computation on a batch rather than individual observations. This allows the TensorFlow execution engine to parallelize computation, without the need for manual synchronization. Environments are stepped in separate Python processes to progress them in parallel witho...

  11. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia

    Science.gov (United States)

    Markou, Athina; Salamone, John D.; Bussey, Timothy; Mar, Adam; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-01-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu). A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. PMID:23994273

  12. Measuring reinforcement learning and motivation constructs in experimental animals: relevance to the negative symptoms of schizophrenia.

    Science.gov (United States)

    Markou, Athina; Salamone, John D; Bussey, Timothy J; Mar, Adam C; Brunner, Daniela; Gilmour, Gary; Balsam, Peter

    2013-11-01

    The present review article summarizes and expands upon the discussions that were initiated during a meeting of the Cognitive Neuroscience Treatment Research to Improve Cognition in Schizophrenia (CNTRICS; http://cntrics.ucdavis.edu) meeting. A major goal of the CNTRICS meeting was to identify experimental procedures and measures that can be used in laboratory animals to assess psychological constructs that are related to the psychopathology of schizophrenia. The issues discussed in this review reflect the deliberations of the Motivation Working Group of the CNTRICS meeting, which included most of the authors of this article as well as additional participants. After receiving task nominations from the general research community, this working group was asked to identify experimental procedures in laboratory animals that can assess aspects of reinforcement learning and motivation that may be relevant for research on the negative symptoms of schizophrenia, as well as other disorders characterized by deficits in reinforcement learning and motivation. The tasks described here that assess reinforcement learning are the Autoshaping Task, Probabilistic Reward Learning Tasks, and the Response Bias Probabilistic Reward Task. The tasks described here that assess motivation are Outcome Devaluation and Contingency Degradation Tasks and Effort-Based Tasks. In addition to describing such methods and procedures, the present article provides a working vocabulary for research and theory in this field, as well as an industry perspective about how such tasks may be used in drug discovery. It is hoped that this review can aid investigators who are conducting research in this complex area, promote translational studies by highlighting shared research goals and fostering a common vocabulary across basic and clinical fields, and facilitate the development of medications for the treatment of symptoms mediated by reinforcement learning and motivational deficits. Copyright © 2013 Elsevier

  13. Sensitivity to Temporal Reward Structure in Amygdala Neurons

    OpenAIRE

    Bermudez, Maria A.; Göbel, Carl; Schultz, Wolfram

    2012-01-01

    Summary The time of reward and the temporal structure of reward occurrence fundamentally influence behavioral reinforcement and decision processes [1–11]. However, despite knowledge about timing in sensory and motor systems [12–17], we know little about temporal mechanisms of neuronal reward processing. In this experiment, visual stimuli predicted different instantaneous probabilities of reward occurrence that resulted in specific temporal reward structures. Licking behavior demonstrated that...

  14. Fast Conflict Resolution Based on Reinforcement Learning in Multi-agent System

    Institute of Scientific and Technical Information of China (English)

    PIAOSonghao; HONGBingrong; CHUHaitao

    2004-01-01

    In multi-agent system where each agen thas a different goal (even the team of agents has the same goal), agents must be able to resolve conflicts arising in the process of achieving their goal. Many researchers presented methods for conflict resolution, e.g., Reinforcement learning (RL), but the conventional RL requires a large computation cost because every agent must learn, at the same time the overlap of actions selected by each agent results in local conflict. Therefore in this paper, we propose a novel method to solve these problems. In order to deal with the conflict within the multi-agent system, the concept of potential field function based Action selection priority level (ASPL) is brought forward. In this method, all kinds of environment factor that may have influence on the priority are effectively computed with the potential field function. So the priority to access the local resource can be decided rapidly. By avoiding the complex coordination mechanism used in general multi-agent system, the conflict in multi-agent system is settled more efficiently. Our system consists of RL with ASPL module and generalized rules module. Using ASPL, RL module chooses a proper cooperative behavior, and generalized rule module can accelerate the learning process. By applying the proposed method to Robot Soccer, the learning process can be accelerated. The results of simulation and real experiments indicate the effectiveness of the method.

  15. Switching Reinforcement Learning for Continuous Action Space

    Science.gov (United States)

    Nagayoshi, Masato; Murao, Hajime; Tamaki, Hisashi

    Reinforcement Learning (RL) attracts much attention as a technique of realizing computational intelligence such as adaptive and autonomous decentralized systems. In general, however, it is not easy to put RL into practical use. This difficulty includes a problem of designing a suitable action space of an agent, i.e., satisfying two requirements in trade-off: (i) to keep the characteristics (or structure) of an original search space as much as possible in order to seek strategies that lie close to the optimal, and (ii) to reduce the search space as much as possible in order to expedite the learning process. In order to design a suitable action space adaptively, we propose switching RL model to mimic a process of an infant's motor development in which gross motor skills develop before fine motor skills. Then, a method for switching controllers is constructed by introducing and referring to the “entropy”. Further, through computational experiments by using robot navigation problems with one and two-dimensional continuous action space, the validity of the proposed method has been confirmed.

  16. Neural Control of a Tracking Task via Attention-Gated Reinforcement Learning for Brain-Machine Interfaces.

    Science.gov (United States)

    Wang, Yiwen; Wang, Fang; Xu, Kai; Zhang, Qiaosheng; Zhang, Shaomin; Zheng, Xiaoxiang

    2015-05-01

    Reinforcement learning (RL)-based brain machine interfaces (BMIs) enable the user to learn from the environment through interactions to complete the task without desired signals, which is promising for clinical applications. Previous studies exploited Q-learning techniques to discriminate neural states into simple directional actions providing the trial initial timing. However, the movements in BMI applications can be quite complicated, and the action timing explicitly shows the intention when to move. The rich actions and the corresponding neural states form a large state-action space, imposing generalization difficulty on Q-learning. In this paper, we propose to adopt attention-gated reinforcement learning (AGREL) as a new learning scheme for BMIs to adaptively decode high-dimensional neural activities into seven distinct movements (directional moves, holdings and resting) due to the efficient weight-updating. We apply AGREL on neural data recorded from M1 of a monkey to directly predict a seven-action set in a time sequence to reconstruct the trajectory of a center-out task. Compared to Q-learning techniques, AGREL could improve the target acquisition rate to 90.16% in average with faster convergence and more stability to follow neural activity over multiple days, indicating the potential to achieve better online decoding performance for more complicated BMI tasks.

  17. Binocular rivalry produced by temporal frequency differences

    Directory of Open Access Journals (Sweden)

    David eAlais

    2012-07-01

    Full Text Available Binocular rivalry occurs when each eye views images that are markedly different. Rather than seeing a binocular fusion of the two, each image is seen exclusively in a stochastic alternation of the monocular images. Here we examine whether temporal frequency differences will trigger binocular rivalry by presenting two random dot arrays that are spatially matched but which modulate temporally at two different rates and contained no net translation. We found that a perceptual alternation between the two temporal frequencies did indeed occur, provided the frequencies were sufficiently different, indicating that temporal information can produce binocular rivalry in the absence of spatial conflict. This finding is discussed with regard to the dependence of rivalry on conflict between spatial and temporal channels.

  18. Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing

    Science.gov (United States)

    Lefebvre, Germain; Blakemore, Sarah-Jayne

    2017-01-01

    Previous studies suggest that factual learning, that is, learning from obtained outcomes, is biased, such that participants preferentially take into account positive, as compared to negative, prediction errors. However, whether or not the prediction error valence also affects counterfactual learning, that is, learning from forgone outcomes, is unknown. To address this question, we analysed the performance of two groups of participants on reinforcement learning tasks using a computational model that was adapted to test if prediction error valence influences learning. We carried out two experiments: in the factual learning experiment, participants learned from partial feedback (i.e., the outcome of the chosen option only); in the counterfactual learning experiment, participants learned from complete feedback information (i.e., the outcomes of both the chosen and unchosen option were displayed). In the factual learning experiment, we replicated previous findings of a valence-induced bias, whereby participants learned preferentially from positive, relative to negative, prediction errors. In contrast, for counterfactual learning, we found the opposite valence-induced bias: negative prediction errors were preferentially taken into account, relative to positive ones. When considering valence-induced bias in the context of both factual and counterfactual learning, it appears that people tend to preferentially take into account information that confirms their current choice. PMID:28800597

  19. Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing.

    Science.gov (United States)

    Palminteri, Stefano; Lefebvre, Germain; Kilford, Emma J; Blakemore, Sarah-Jayne

    2017-08-01

    Previous studies suggest that factual learning, that is, learning from obtained outcomes, is biased, such that participants preferentially take into account positive, as compared to negative, prediction errors. However, whether or not the prediction error valence also affects counterfactual learning, that is, learning from forgone outcomes, is unknown. To address this question, we analysed the performance of two groups of participants on reinforcement learning tasks using a computational model that was adapted to test if prediction error valence influences learning. We carried out two experiments: in the factual learning experiment, participants learned from partial feedback (i.e., the outcome of the chosen option only); in the counterfactual learning experiment, participants learned from complete feedback information (i.e., the outcomes of both the chosen and unchosen option were displayed). In the factual learning experiment, we replicated previous findings of a valence-induced bias, whereby participants learned preferentially from positive, relative to negative, prediction errors. In contrast, for counterfactual learning, we found the opposite valence-induced bias: negative prediction errors were preferentially taken into account, relative to positive ones. When considering valence-induced bias in the context of both factual and counterfactual learning, it appears that people tend to preferentially take into account information that confirms their current choice.

  20. Seismic response of reinforced concrete frames at different damage levels

    Science.gov (United States)

    Morales-González, Merangeli; Vidot-Vega, Aidcer L.

    2017-03-01

    Performance-based seismic engineering is focused on the definition of limit states to represent different levels of damage, which can be described by material strains, drifts, displacements or even changes in dissipating properties and stiffness of the structure. This study presents a research plan to evaluate the behavior of reinforced concrete (RC) moment resistant frames at different performance levels established by the ASCE 41-06 seismic rehabilitation code. Sixteen RC plane moment frames with different span-to-depth ratios and three 3D RC frames were analyzed to evaluate their seismic behavior at different damage levels established by the ASCE 41-06. For each span-to-depth ratio, four different beam longitudinal reinforcement steel ratios were used that varied from 0.85 to 2.5% for the 2D frames. Nonlinear time history analyses of the frames were performed using scaled ground motions. The impact of different span-to-depth and reinforcement ratios on the damage levels was evaluated. Material strains, rotations and seismic hysteretic energy changes at different damage levels were studied.

  1. Within- and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory.

    Science.gov (United States)

    Collins, Anne G E; Frank, Michael J

    2018-03-06

    Learning from rewards and punishments is essential to survival and facilitates flexible human behavior. It is widely appreciated that multiple cognitive and reinforcement learning systems contribute to decision-making, but the nature of their interactions is elusive. Here, we leverage methods for extracting trial-by-trial indices of reinforcement learning (RL) and working memory (WM) in human electro-encephalography to reveal single-trial computations beyond that afforded by behavior alone. Neural dynamics confirmed that increases in neural expectation were predictive of reduced neural surprise in the following feedback period, supporting central tenets of RL models. Within- and cross-trial dynamics revealed a cooperative interplay between systems for learning, in which WM contributes expectations to guide RL, despite competition between systems during choice. Together, these results provide a deeper understanding of how multiple neural systems interact for learning and decision-making and facilitate analysis of their disruption in clinical populations.

  2. A Reinforcement Learning Approach to Call Admission Control in HAPS Communication System

    Directory of Open Access Journals (Sweden)

    Ni Shu Yan

    2017-01-01

    Full Text Available The large changing of link capacity and number of users caused by the movement of both platform and users in communication system based on high altitude platform station (HAPS will resulting in high dropping rate of handover and reduce resource utilization. In order to solve these problems, this paper proposes an adaptive call admission control strategy based on reinforcement learning approach. The goal of this strategy is to maximize long-term gains of system, with the introduction of cross-layer interaction and the service downgraded. In order to access different traffics adaptively, the access utility of handover traffics and new call traffics is designed in different state of communication system. Numerical simulation result shows that the proposed call admission control strategy can enhance bandwidth resource utilization and the performances of handover traffics.

  3. Performance Comparison of Two Reinforcement Learning Algorithms for Small Mobile Robots

    Czech Academy of Sciences Publication Activity Database

    Neruda, Roman; Slušný, Stanislav

    2009-01-01

    Roč. 2, č. 1 (2009), s. 59-68 ISSN 2005-4297 R&D Projects: GA MŠk(CZ) 1M0567 Grant - others:GA UK(CZ) 7637/2007 Institutional research plan: CEZ:AV0Z10300504 Keywords : reinforcement learning * mobile robots * inteligent agents Subject RIV: IN - Informatics, Computer Science http://www.sersc.org/journals/IJCA/vol2_no1/7.pdf

  4. Design issues of a reinforcement-based self-learning fuzzy controller for petrochemical process control

    Science.gov (United States)

    Yen, John; Wang, Haojin; Daugherity, Walter C.

    1992-01-01

    Fuzzy logic controllers have some often-cited advantages over conventional techniques such as PID control, including easier implementation, accommodation to natural language, and the ability to cover a wider range of operating conditions. One major obstacle that hinders the broader application of fuzzy logic controllers is the lack of a systematic way to develop and modify their rules; as a result the creation and modification of fuzzy rules often depends on trial and error or pure experimentation. One of the proposed approaches to address this issue is a self-learning fuzzy logic controller (SFLC) that uses reinforcement learning techniques to learn the desirability of states and to adjust the consequent part of its fuzzy control rules accordingly. Due to the different dynamics of the controlled processes, the performance of a self-learning fuzzy controller is highly contingent on its design. The design issue has not received sufficient attention. The issues related to the design of a SFLC for application to a petrochemical process are discussed, and its performance is compared with that of a PID and a self-tuning fuzzy logic controller.

  5. IMPLEMENTATION OF MULTIAGENT REINFORCEMENT LEARNING MECHANISM FOR OPTIMAL ISLANDING OPERATION OF DISTRIBUTION NETWORK

    DEFF Research Database (Denmark)

    Saleem, Arshad; Lind, Morten

    2008-01-01

    among electric power utilities to utilize modern information and communication technologies (ICT) in order to improve the automation of the distribution system. In this paper we present our work for the implementation of a dynamic multi-agent based distributed reinforcement learning mechanism...

  6. Pedunculopontine tegmental nucleus lesions impair stimulus--reward learning in autoshaping and conditioned reinforcement paradigms.

    Science.gov (United States)

    Inglis, W L; Olmstead, M C; Robbins, T W

    2000-04-01

    The role of the pedunculopontine tegmental nucleus (PPTg) in stimulus-reward learning was assessed by testing the effects of PPTg lesions on performance in visual autoshaping and conditioned reinforcement (CRf) paradigms. Rats with PPTg lesions were unable to learn an association between a conditioned stimulus (CS) and a primary reward in either paradigm. In the autoshaping experiment, PPTg-lesioned rats approached the CS+ and CS- with equal frequency, and the latencies to respond to the two stimuli did not differ. PPTg lesions also disrupted discriminated approaches to an appetitive CS in the CRf paradigm and completely abolished the acquisition of responding with CRf. These data are discussed in the context of a possible cognitive function of the PPTg, particularly in terms of lesion-induced disruptions of attentional processes that are mediated by the thalamus.

  7. Neural systems underlying aversive conditioning in humans with primary and secondary reinforcers

    Directory of Open Access Journals (Sweden)

    Mauricio R Delgado

    2011-05-01

    Full Text Available Money is a secondary reinforcer commonly used across a range of disciplines in experimental paradigms investigating reward learning and decision-making. The effectiveness of monetary reinforcers during aversive learning and its neural basis, however, remains a topic of debate. Specifically, it is unclear if the initial acquisition of aversive representations of monetary losses depends on similar neural systems as more traditional aversive conditioning that involves primary reinforcers. This study contrasts the efficacy of a biologically defined primary reinforcer (shock and a socially defined secondary reinforcer (money during aversive learning and its associated neural circuitry. During a two-part experiment, participants first played a gambling game where wins and losses were based on performance to gain an experimental bank. Participants were then exposed to two separate aversive conditioning sessions. In one session, a primary reinforcer (mild shock served as an unconditioned stimulus (US and was paired with one of two colored squares, the conditioned stimuli (CS+ and CS-, respectively. In another session, a secondary reinforcer (loss of money served as the US and was paired with one of two different CS. Skin conductance responses were greater for CS+ compared to CS- trials irrespective of type of reinforcer. Neuroimaging results revealed that the striatum, a region typically linked with reward-related processing, was found to be involved in the acquisition of aversive conditioned response irrespective of reinforcer type. In contrast, the amygdala was involved during aversive conditioning with primary reinforcers, as suggested by both an exploratory fMRI analysis and a follow-up case study with a patient with bilateral amygdala damage. Taken together, these results suggest that learning about potential monetary losses may depend on reinforcement learning related systems, rather than on typical structures involved in more biologically based

  8. Accelerating Multiagent Reinforcement Learning by Equilibrium Transfer.

    Science.gov (United States)

    Hu, Yujing; Gao, Yang; An, Bo

    2015-07-01

    An important approach in multiagent reinforcement learning (MARL) is equilibrium-based MARL, which adopts equilibrium solution concepts in game theory and requires agents to play equilibrium strategies at each state. However, most existing equilibrium-based MARL algorithms cannot scale due to a large number of computationally expensive equilibrium computations (e.g., computing Nash equilibria is PPAD-hard) during learning. For the first time, this paper finds that during the learning process of equilibrium-based MARL, the one-shot games corresponding to each state's successive visits often have the same or similar equilibria (for some states more than 90% of games corresponding to successive visits have similar equilibria). Inspired by this observation, this paper proposes to use equilibrium transfer to accelerate equilibrium-based MARL. The key idea of equilibrium transfer is to reuse previously computed equilibria when each agent has a small incentive to deviate. By introducing transfer loss and transfer condition, a novel framework called equilibrium transfer-based MARL is proposed. We prove that although equilibrium transfer brings transfer loss, equilibrium-based MARL algorithms can still converge to an equilibrium policy under certain assumptions. Experimental results in widely used benchmarks (e.g., grid world game, soccer game, and wall game) show that the proposed framework: 1) not only significantly accelerates equilibrium-based MARL (up to 96.7% reduction in learning time), but also achieves higher average rewards than algorithms without equilibrium transfer and 2) scales significantly better than algorithms without equilibrium transfer when the state/action space grows and the number of agents increases.

  9. Exploring Temporal Sequences of Regulatory Phases and Associated Interactions in Low- and High-Challenge Collaborative Learning Sessions

    Science.gov (United States)

    Sobocinski, Márta; Malmberg, Jonna; Järvelä, Sanna

    2017-01-01

    Investigating the temporal order of regulatory processes can explain in more detail the mechanisms behind success or lack of success during collaborative learning. The aim of this study is to explore the differences between high- and low-challenge collaborative learning sessions. This is achieved through examining how the three phases of…

  10. Reinforcement learning produces dominant strategies for the Iterated Prisoner's Dilemma.

    Science.gov (United States)

    Harper, Marc; Knight, Vincent; Jones, Martin; Koutsovoulos, Georgios; Glynatsi, Nikoleta E; Campbell, Owen

    2017-01-01

    We present tournament results and several powerful strategies for the Iterated Prisoner's Dilemma created using reinforcement learning techniques (evolutionary and particle swarm algorithms). These strategies are trained to perform well against a corpus of over 170 distinct opponents, including many well-known and classic strategies. All the trained strategies win standard tournaments against the total collection of other opponents. The trained strategies and one particular human made designed strategy are the top performers in noisy tournaments also.

  11. Reinforcement learning of targeted movement in a spiking neuronal model of motor cortex.

    Directory of Open Access Journals (Sweden)

    George L Chadderdon

    Full Text Available Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint "forearm" to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1, no learning (0, or punishment (-1, corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.

  12. Reinforcement learning of targeted movement in a spiking neuronal model of motor cortex.

    Science.gov (United States)

    Chadderdon, George L; Neymotin, Samuel A; Kerr, Cliff C; Lytton, William W

    2012-01-01

    Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint "forearm" to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (-1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.

  13. Intrinsically motivated reinforcement learning for human-robot interaction in the real-world.

    Science.gov (United States)

    Qureshi, Ahmed Hussain; Nakamura, Yutaka; Yoshikawa, Yuichiro; Ishiguro, Hiroshi

    2018-03-26

    For a natural social human-robot interaction, it is essential for a robot to learn the human-like social skills. However, learning such skills is notoriously hard due to the limited availability of direct instructions from people to teach a robot. In this paper, we propose an intrinsically motivated reinforcement learning framework in which an agent gets the intrinsic motivation-based rewards through the action-conditional predictive model. By using the proposed method, the robot learned the social skills from the human-robot interaction experiences gathered in the real uncontrolled environments. The results indicate that the robot not only acquired human-like social skills but also took more human-like decisions, on a test dataset, than a robot which received direct rewards for the task achievement. Copyright © 2018 Elsevier Ltd. All rights reserved.

  14. Engagement in Classroom Learning: Creating Temporal Participation Incentives for Extrinsically Motivated Students through Bonus Credits

    Science.gov (United States)

    Rassuli, Ali

    2012-01-01

    Extrinsic inducements to adjust students' learning motivations have evolved within 2 opposing paradigms. Cognitive evaluation theories claim that controlling factors embedded in extrinsic rewards dissipate intrinsic aspirations. Behavioral theorists contend that if engagement is voluntary, extrinsic reinforcements enhance learning without ill…

  15. Investigations of timing during the schedule and reinforcement intervals with wheel-running reinforcement.

    Science.gov (United States)

    Belke, Terry W; Christie-Fougere, Melissa M

    2006-11-01

    Across two experiments, a peak procedure was used to assess the timing of the onset and offset of an opportunity to run as a reinforcer. The first experiment investigated the effect of reinforcer duration on temporal discrimination of the onset of the reinforcement interval. Three male Wistar rats were exposed to fixed-interval (FI) 30-s schedules of wheel-running reinforcement and the duration of the opportunity to run was varied across values of 15, 30, and 60s. Each session consisted of 50 reinforcers and 10 probe trials. Results showed that as reinforcer duration increased, the percentage of postreinforcement pauses longer than the 30-s schedule interval increased. On probe trials, peak response rates occurred near the time of reinforcer delivery and peak times varied with reinforcer duration. In a second experiment, seven female Long-Evans rats were exposed to FI 30-s schedules leading to 30-s opportunities to run. Timing of the onset and offset of the reinforcement period was assessed by probe trials during the schedule interval and during the reinforcement interval in separate conditions. The results provided evidence of timing of the onset, but not the offset of the wheel-running reinforcement period. Further research is required to assess if timing occurs during a wheel-running reinforcement period.

  16. Reinforcement learning produces dominant strategies for the Iterated Prisoner's Dilemma.

    Directory of Open Access Journals (Sweden)

    Marc Harper

    Full Text Available We present tournament results and several powerful strategies for the Iterated Prisoner's Dilemma created using reinforcement learning techniques (evolutionary and particle swarm algorithms. These strategies are trained to perform well against a corpus of over 170 distinct opponents, including many well-known and classic strategies. All the trained strategies win standard tournaments against the total collection of other opponents. The trained strategies and one particular human made designed strategy are the top performers in noisy tournaments also.

  17. Memory Transformation Enhances Reinforcement Learning in Dynamic Environments.

    Science.gov (United States)

    Santoro, Adam; Frankland, Paul W; Richards, Blake A

    2016-11-30

    Over the course of systems consolidation, there is a switch from a reliance on detailed episodic memories to generalized schematic memories. This switch is sometimes referred to as "memory transformation." Here we demonstrate a previously unappreciated benefit of memory transformation, namely, its ability to enhance reinforcement learning in a dynamic environment. We developed a neural network that is trained to find rewards in a foraging task where reward locations are continuously changing. The network can use memories for specific locations (episodic memories) and statistical patterns of locations (schematic memories) to guide its search. We find that switching from an episodic to a schematic strategy over time leads to enhanced performance due to the tendency for the reward location to be highly correlated with itself in the short-term, but regress to a stable distribution in the long-term. We also show that the statistics of the environment determine the optimal utilization of both types of memory. Our work recasts the theoretical question of why memory transformation occurs, shifting the focus from the avoidance of memory interference toward the enhancement of reinforcement learning across multiple timescales. As time passes, memories transform from a highly detailed state to a more gist-like state, in a process called "memory transformation." Theories of memory transformation speak to its advantages in terms of reducing memory interference, increasing memory robustness, and building models of the environment. However, the role of memory transformation from the perspective of an agent that continuously acts and receives reward in its environment is not well explored. In this work, we demonstrate a view of memory transformation that defines it as a way of optimizing behavior across multiple timescales. Copyright © 2016 the authors 0270-6474/16/3612228-15$15.00/0.

  18. A Day-to-Day Route Choice Model Based on Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Fangfang Wei

    2014-01-01

    Full Text Available Day-to-day traffic dynamics are generated by individual traveler’s route choice and route adjustment behaviors, which are appropriate to be researched by using agent-based model and learning theory. In this paper, we propose a day-to-day route choice model based on reinforcement learning and multiagent simulation. Travelers’ memory, learning rate, and experience cognition are taken into account. Then the model is verified and analyzed. Results show that the network flow can converge to user equilibrium (UE if travelers can remember all the travel time they have experienced, but which is not necessarily the case under limited memory; learning rate can strengthen the flow fluctuation, but memory leads to the contrary side; moreover, high learning rate results in the cyclical oscillation during the process of flow evolution. Finally, both the scenarios of link capacity degradation and random link capacity are used to illustrate the model’s applications. Analyses and applications of our model demonstrate the model is reasonable and useful for studying the day-to-day traffic dynamics.

  19. Reinforcement Learning Based Data Self-Destruction Scheme for Secured Data Management

    Directory of Open Access Journals (Sweden)

    Young Ki Kim

    2018-04-01

    Full Text Available As technologies and services that leverage cloud computing have evolved, the number of businesses and individuals who use them are increasing rapidly. In the course of using cloud services, as users store and use data that include personal information, research on privacy protection models to protect sensitive information in the cloud environment is becoming more important. As a solution to this problem, a self-destructing scheme has been proposed that prevents the decryption of encrypted user data after a certain period of time using a Distributed Hash Table (DHT network. However, the existing self-destructing scheme does not mention how to set the number of key shares and the threshold value considering the environment of the dynamic DHT network. This paper proposes a method to set the parameters to generate the key shares needed for the self-destructing scheme considering the availability and security of data. The proposed method defines state, action, and reward of the reinforcement learning model based on the similarity of the graph, and applies the self-destructing scheme process by updating the parameter based on the reinforcement learning model. Through the proposed technique, key sharing parameters can be set in consideration of data availability and security in dynamic DHT network environments.

  20. Bio-robots automatic navigation with graded electric reward stimulation based on Reinforcement Learning.

    Science.gov (United States)

    Zhang, Chen; Sun, Chao; Gao, Liqiang; Zheng, Nenggan; Chen, Weidong; Zheng, Xiaoxiang

    2013-01-01

    Bio-robots based on brain computer interface (BCI) suffer from the lack of considering the characteristic of the animals in navigation. This paper proposed a new method for bio-robots' automatic navigation combining the reward generating algorithm base on Reinforcement Learning (RL) with the learning intelligence of animals together. Given the graded electrical reward, the animal e.g. the rat, intends to seek the maximum reward while exploring an unknown environment. Since the rat has excellent spatial recognition, the rat-robot and the RL algorithm can convergent to an optimal route by co-learning. This work has significant inspiration for the practical development of bio-robots' navigation with hybrid intelligence.

  1. Spike-Based Bayesian-Hebbian Learning of Temporal Sequences

    DEFF Research Database (Denmark)

    Tully, Philip J; Lindén, Henrik; Hennig, Matthias H

    2016-01-01

    Many cognitive and motor functions are enabled by the temporal representation and processing of stimuli, but it remains an open issue how neocortical microcircuits can reliably encode and replay such sequences of information. To better understand this, a modular attractor memory network is proposed...... in which meta-stable sequential attractor transitions are learned through changes to synaptic weights and intrinsic excitabilities via the spike-based Bayesian Confidence Propagation Neural Network (BCPNN) learning rule. We find that the formation of distributed memories, embodied by increased periods...

  2. Effect of steel reinforcement with different degree of corrosion on degeneration of mechanical performance of reinforced concrete frame joints

    Directory of Open Access Journals (Sweden)

    Wu Xu

    2016-02-01

    Full Text Available Beam-column joints which shoulders high-level and vertical shearing effect that maintains balance of beam and column end is the major component influencing the performance of the whole framework. Post earthquake investigation suggests that collapse of frame structure is induced by failure of joints in most cases. Thus, beam-column joints must have strong bearing capacity and good ductility, and reinforced concrete structure just meets the above requirement. But corrosion caused by long time use of reinforced concrete framework will lead to degeneration of mechanical performance of joints. To find out the rule of effect of steel reinforcement with different corrosion rate on degeneration of bearing capacity of reinforced concrete framework joints, this study made a nonlinear numerical analysis on fifteen models without stirrup in the core area of reinforced concrete frame joints using displacement method considering axial load ratio of column end and constraint condition. This work aims to find out the key factor that influences mechanical performance of joints, thus to provide a basis for repair and reinforcement of degenerated framework joints.

  3. The Impact of Students' Temporal Perspectives on Time-on-Task and Learning Performance in Game Based Learning

    Science.gov (United States)

    Romero, Margarida; Usart, Mireia

    2013-01-01

    The use of games for educational purposes has been considered as a learning methodology that attracts the students' attention and may allow focusing individuals on the learning activity through the [serious games] SG game dynamic. Based on the hypothesis that students' Temporal Perspective has an impact on learning performance and time-on-task,…

  4. A Reinforcement Learning Model Equipped with Sensors for Generating Perception Patterns: Implementation of a Simulated Air Navigation System Using ADS-B (Automatic Dependent Surveillance-Broadcast) Technology.

    Science.gov (United States)

    Álvarez de Toledo, Santiago; Anguera, Aurea; Barreiro, José M; Lara, Juan A; Lizcano, David

    2017-01-19

    Over the last few decades, a number of reinforcement learning techniques have emerged, and different reinforcement learning-based applications have proliferated. However, such techniques tend to specialize in a particular field. This is an obstacle to their generalization and extrapolation to other areas. Besides, neither the reward-punishment (r-p) learning process nor the convergence of results is fast and efficient enough. To address these obstacles, this research proposes a general reinforcement learning model. This model is independent of input and output types and based on general bioinspired principles that help to speed up the learning process. The model is composed of a perception module based on sensors whose specific perceptions are mapped as perception patterns. In this manner, similar perceptions (even if perceived at different positions in the environment) are accounted for by the same perception pattern. Additionally, the model includes a procedure that statistically associates perception-action pattern pairs depending on the positive or negative results output by executing the respective action in response to a particular perception during the learning process. To do this, the model is fitted with a mechanism that reacts positively or negatively to particular sensory stimuli in order to rate results. The model is supplemented by an action module that can be configured depending on the maneuverability of each specific agent. The model has been applied in the air navigation domain, a field with strong safety restrictions, which led us to implement a simulated system equipped with the proposed model. Accordingly, the perception sensors were based on Automatic Dependent Surveillance-Broadcast (ADS-B) technology, which is described in this paper. The results were quite satisfactory, and it outperformed traditional methods existing in the literature with respect to learning reliability and efficiency.

  5. Using crowdsourcing to compare temporal, social temporal, and probability discounting among obese and non-obese individuals.

    Science.gov (United States)

    Bickel, Warren K; George Wilson, A; Franck, Christopher T; Terry Mueller, E; Jarmolowicz, David P; Koffarnus, Mikhail N; Fede, Samantha J

    2014-04-01

    Previous research comparing obese and non-obese samples on the delayed discounting procedure has produced mixed results. The aim of the current study was to clarify these discrepant findings by comparing a variety of temporal discounting measures in a large sample of internet users (n=1163) obtained from a crowdsourcing service, Amazon Mechanical Turk (AMT). Measures of temporal, social-temporal (a combination of standard and social temporal), and probability discounting were obtained. Significant differences were obtained on all discounting measures except probability discounting, but the obtained effect sizes were small. These data suggest that larger-N studies will be more likely to detect differences between obese and non-obese samples, and may afford the opportunity, in future studies, to decompose a large obese sample into different subgroups to examine the effect of other relevant measures, such as the reinforcing value of food, on discounting. Copyright © 2013 Elsevier Ltd. All rights reserved.

  6. Variability in Dopamine Genes Dissociates Model-Based and Model-Free Reinforcement Learning.

    Science.gov (United States)

    Doll, Bradley B; Bath, Kevin G; Daw, Nathaniel D; Frank, Michael J

    2016-01-27

    Considerable evidence suggests that multiple learning systems can drive behavior. Choice can proceed reflexively from previous actions and their associated outcomes, as captured by "model-free" learning algorithms, or flexibly from prospective consideration of outcomes that might occur, as captured by "model-based" learning algorithms. However, differential contributions of dopamine to these systems are poorly understood. Dopamine is widely thought to support model-free learning by modulating plasticity in striatum. Model-based learning may also be affected by these striatal effects, or by other dopaminergic effects elsewhere, notably on prefrontal working memory function. Indeed, prominent demonstrations linking striatal dopamine to putatively model-free learning did not rule out model-based effects, whereas other studies have reported dopaminergic modulation of verifiably model-based learning, but without distinguishing a prefrontal versus striatal locus. To clarify the relationships between dopamine, neural systems, and learning strategies, we combine a genetic association approach in humans with two well-studied reinforcement learning tasks: one isolating model-based from model-free behavior and the other sensitive to key aspects of striatal plasticity. Prefrontal function was indexed by a polymorphism in the COMT gene, differences of which reflect dopamine levels in the prefrontal cortex. This polymorphism has been associated with differences in prefrontal activity and working memory. Striatal function was indexed by a gene coding for DARPP-32, which is densely expressed in the striatum where it is necessary for synaptic plasticity. We found evidence for our hypothesis that variations in prefrontal dopamine relate to model-based learning, whereas variations in striatal dopamine function relate to model-free learning. Decisions can stem reflexively from their previously associated outcomes or flexibly from deliberative consideration of potential choice outcomes

  7. Curiosity driven reinforcement learning for motion planning on humanoids

    Science.gov (United States)

    Frank, Mikhail; Leitner, Jürgen; Stollenga, Marijn; Förster, Alexander; Schmidhuber, Jürgen

    2014-01-01

    Most previous work on artificial curiosity (AC) and intrinsic motivation focuses on basic concepts and theory. Experimental results are generally limited to toy scenarios, such as navigation in a simulated maze, or control of a simple mechanical system with one or two degrees of freedom. To study AC in a more realistic setting, we embody a curious agent in the complex iCub humanoid robot. Our novel reinforcement learning (RL) framework consists of a state-of-the-art, low-level, reactive control layer, which controls the iCub while respecting constraints, and a high-level curious agent, which explores the iCub's state-action space through information gain maximization, learning a world model from experience, controlling the actual iCub hardware in real-time. To the best of our knowledge, this is the first ever embodied, curious agent for real-time motion planning on a humanoid. We demonstrate that it can learn compact Markov models to represent large regions of the iCub's configuration space, and that the iCub explores intelligently, showing interest in its physical constraints as well as in objects it finds in its environment. PMID:24432001

  8. Tunnel Ventilation Control Using Reinforcement Learning Methodology

    Science.gov (United States)

    Chu, Baeksuk; Kim, Dongnam; Hong, Daehie; Park, Jooyoung; Chung, Jin Taek; Kim, Tae-Hyung

    The main purpose of tunnel ventilation system is to maintain CO pollutant concentration and VI (visibility index) under an adequate level to provide drivers with comfortable and safe driving environment. Moreover, it is necessary to minimize power consumption used to operate ventilation system. To achieve the objectives, the control algorithm used in this research is reinforcement learning (RL) method. RL is a goal-directed learning of a mapping from situations to actions without relying on exemplary supervision or complete models of the environment. The goal of RL is to maximize a reward which is an evaluative feedback from the environment. In the process of constructing the reward of the tunnel ventilation system, two objectives listed above are included, that is, maintaining an adequate level of pollutants and minimizing power consumption. RL algorithm based on actor-critic architecture and gradient-following algorithm is adopted to the tunnel ventilation system. The simulations results performed with real data collected from existing tunnel ventilation system and real experimental verification are provided in this paper. It is confirmed that with the suggested controller, the pollutant level inside the tunnel was well maintained under allowable limit and the performance of energy consumption was improved compared to conventional control scheme.

  9. Rats bred for helplessness exhibit positive reinforcement learning deficits which are not alleviated by an antidepressant dose of the MAO-B inhibitor deprenyl.

    Science.gov (United States)

    Schulz, Daniela; Henn, Fritz A; Petri, David; Huston, Joseph P

    2016-08-04

    Principles of negative reinforcement learning may play a critical role in the etiology and treatment of depression. We examined the integrity of positive reinforcement learning in congenitally helpless (cH) rats, an animal model of depression, using a random ratio schedule and a devaluation-extinction procedure. Furthermore, we tested whether an antidepressant dose of the monoamine oxidase (MAO)-B inhibitor deprenyl would reverse any deficits in positive reinforcement learning. We found that cH rats (n=9) were impaired in the acquisition of even simple operant contingencies, such as a fixed interval (FI) 20 schedule. cH rats exhibited no apparent deficits in appetite or reward sensitivity. They reacted to the devaluation of food in a manner consistent with a dose-response relationship. Reinforcer motivation as assessed by lever pressing across sessions with progressively decreasing reward probabilities was highest in congenitally non-helpless (cNH, n=10) rats as long as the reward probabilities remained relatively high. cNH compared to wild-type (n=10) rats were also more resistant to extinction across sessions. Compared to saline (n=5), deprenyl (n=5) reduced the duration of immobility of cH rats in the forced swimming test, indicative of antidepressant effects, but did not restore any deficits in the acquisition of a FI 20 schedule. We conclude that positive reinforcement learning was impaired in rats bred for helplessness, possibly due to motivational impairments but not deficits in reward sensitivity, and that deprenyl exerted antidepressant effects but did not reverse the deficits in positive reinforcement learning. Copyright © 2016 IBRO. Published by Elsevier Ltd. All rights reserved.

  10. Fuzzy OLAP association rules mining-based modular reinforcement learning approach for multiagent systems.

    Science.gov (United States)

    Kaya, Mehmet; Alhajj, Reda

    2005-04-01

    Multiagent systems and data mining have recently attracted considerable attention in the field of computing. Reinforcement learning is the most commonly used learning process for multiagent systems. However, it still has some drawbacks, including modeling other learning agents present in the domain as part of the state of the environment, and some states are experienced much less than others, or some state-action pairs are never visited during the learning phase. Further, before completing the learning process, an agent cannot exhibit a certain behavior in some states that may be experienced sufficiently. In this study, we propose a novel multiagent learning approach to handle these problems. Our approach is based on utilizing the mining process for modular cooperative learning systems. It incorporates fuzziness and online analytical processing (OLAP) based mining to effectively process the information reported by agents. First, we describe a fuzzy data cube OLAP architecture which facilitates effective storage and processing of the state information reported by agents. This way, the action of the other agent, not even in the visual environment. of the agent under consideration, can simply be predicted by extracting online association rules, a well-known data mining technique, from the constructed data cube. Second, we present a new action selection model, which is also based on association rules mining. Finally, we generalize not sufficiently experienced states, by mining multilevel association rules from the proposed fuzzy data cube. Experimental results obtained on two different versions of a well-known pursuit domain show the robustness and effectiveness of the proposed fuzzy OLAP mining based modular learning approach. Finally, we tested the scalability of the approach presented in this paper and compared it with our previous work on modular-fuzzy Q-learning and ordinary Q-learning.

  11. Vision-based Navigation and Reinforcement Learning Path Finding for Social Robots

    OpenAIRE

    Pérez Sala, Xavier

    2010-01-01

    We propose a robust system for automatic Robot Navigation in uncontrolled en- vironments. The system is composed by three main modules: the Arti cial Vision module, the Reinforcement Learning module, and the behavior control module. The aim of the system is to allow a robot to automatically nd a path that arrives to a pre xed goal. Turn and straight movements in uncontrolled environments are automatically estimated and controlled using the proposed modules. The Arti cial Vi...

  12. Rescaling of temporal expectations during extinction

    Science.gov (United States)

    Drew, Michael R.; Walsh, Carolyn; Balsam, Peter D

    2016-01-01

    Previous research suggests that extinction learning is temporally specific. Changing the CS duration between training and extinction can facilitate the loss of the CR within the extinction session but impairs long-term retention of extinction. In two experiments using conditioned magazine approach with rats, we examined the relation between temporal specificity of extinction and CR timing. In Experiment 1 rats were trained on a 12-s, fixed CS-US interval and then extinguished with CS presentations that were 6, 12, or 24 s in duration. The design of Experiment 2 was the same except rats were trained using partial rather than continuous reinforcement. In both experiments, extending the CS duration in extinction facilitated the diminution of CRs during the extinction session, but shortening the CS duration failed to slow extinction. In addition, extending (but not shortening) the CS duration caused temporal rescaling of the CR, in that the peak CR rate migrated later into the trial over the course of extinction training. This migration partially accounted for the faster loss of the CR when the CS duration was extended. Results are incompatible with the hypothesis that extinction is driven by cumulative CS exposure and suggest that temporally extended nonreinforced CS exposure reduces conditioned responding via temporal displacement rather than through extinction per se. PMID:28045291

  13. Comparing Different Classes of Reinforcement to Increase Expressive Language for Individuals with Autism

    Science.gov (United States)

    Leaf, Justin B.; Dale, Stephanie; Kassardjian, Alyne; Tsuji, Kathleen H.; Taubman, Mitchell; McEachin, John J.; Leaf, Ronald B.; Oppenheim-Leaf, Misty L.

    2014-01-01

    One of the basic principles of applied behavior analysis is that behavior change is largely due to that behavior being reinforced. Therefore the use of positive reinforcement is a key component of most behavioral programs for individuals diagnosed with autism. In this study we compared four different classes of reinforcers (i.e., food, praise,…

  14. Long term effects of aversive reinforcement on colour discrimination learning in free-flying bumblebees.

    Directory of Open Access Journals (Sweden)

    Miguel A Rodríguez-Gironés

    Full Text Available The results of behavioural experiments provide important information about the structure and information-processing abilities of the visual system. Nevertheless, if we want to infer from behavioural data how the visual system operates, it is important to know how different learning protocols affect performance and to devise protocols that minimise noise in the response of experimental subjects. The purpose of this work was to investigate how reinforcement schedule and individual variability affect the learning process in a colour discrimination task. Free-flying bumblebees were trained to discriminate between two perceptually similar colours. The target colour was associated with sucrose solution, and the distractor could be associated with water or quinine solution throughout the experiment, or with one substance during the first half of the experiment and the other during the second half. Both acquisition and final performance of the discrimination task (measured as proportion of correct choices were determined by the choice of reinforcer during the first half of the experiment: regardless of whether bees were trained with water or quinine during the second half of the experiment, bees trained with quinine during the first half learned the task faster and performed better during the whole experiment. Our results confirm that the choice of stimuli used during training affects the rate at which colour discrimination tasks are acquired and show that early contact with a strongly aversive stimulus can be sufficient to maintain high levels of attention during several hours. On the other hand, bees which took more time to decide on which flower to alight were more likely to make correct choices than bees which made fast decisions. This result supports the existence of a trade-off between foraging speed and accuracy, and highlights the importance of measuring choice latencies during behavioural experiments focusing on cognitive abilities.

  15. Reinforcement Learning Based Web Service Compositions for Mobile Business

    Science.gov (United States)

    Zhou, Juan; Chen, Shouming

    In this paper, we propose a new solution to Reactive Web Service Composition, via molding with Reinforcement Learning, and introducing modified (alterable) QoS variables into the model as elements in the Markov Decision Process tuple. Moreover, we give an example of Reactive-WSC-based mobile banking, to demonstrate the intrinsic capability of the solution in question of obtaining the optimized service composition, characterized by (alterable) target QoS variable sets with optimized values. Consequently, we come to the conclusion that the solution has decent potentials in boosting customer experiences and qualities of services in Web Services, and those in applications in the whole electronic commerce and business sector.

  16. Global reinforcement training of CrossNets

    Science.gov (United States)

    Ma, Xiaolong

    2007-10-01

    Hybrid "CMOL" integrated circuits, incorporating advanced CMOS devices for neural cell bodies, nanowires as axons and dendrites, and latching switches as synapses, may be used for the hardware implementation of extremely dense (107 cells and 1012 synapses per cm2) neuromorphic networks, operating up to 10 6 times faster than their biological prototypes. We are exploring several "Cross- Net" architectures that accommodate the limitations imposed by CMOL hardware and should allow effective training of the networks without a direct external access to individual synapses. Our studies have show that CrossNets based on simple (two-terminal) crosspoint devices can work well in at least two modes: as Hop-field networks for associative memory and multilayer perceptrons for classification tasks. For more intelligent tasks (such as robot motion control or complex games), which do not have "examples" for supervised learning, more advanced training methods such as the global reinforcement learning are necessary. For application of global reinforcement training algorithms to CrossNets, we have extended Williams's REINFORCE learning principle to a more general framework and derived several learning rules that are more suitable for CrossNet hardware implementation. The results of numerical experiments have shown that these new learning rules can work well for both classification tasks and reinforcement tasks such as the cartpole balancing control problem. Some limitations imposed by the CMOL hardware need to be carefully addressed for the the successful application of in situ reinforcement training to CrossNets.

  17. Continuous theta-burst stimulation (cTBS) over the lateral prefrontal cortex alters reinforcement learning bias.

    Science.gov (United States)

    Ott, Derek V M; Ullsperger, Markus; Jocham, Gerhard; Neumann, Jane; Klein, Tilmann A

    2011-07-15

    The prefrontal cortex is known to play a key role in higher-order cognitive functions. Recently, we showed that this brain region is active in reinforcement learning, during which subjects constantly have to integrate trial outcomes in order to optimize performance. To further elucidate the role of the dorsolateral prefrontal cortex (DLPFC) in reinforcement learning, we applied continuous theta-burst stimulation (cTBS) either to the left or right DLPFC, or to the vertex as a control region, respectively, prior to the performance of a probabilistic learning task in an fMRI environment. While there was no influence of cTBS on learning performance per se, we observed a stimulation-dependent modulation of reward vs. punishment sensitivity: Left-hemispherical DLPFC stimulation led to a more reward-guided performance, while right-hemispherical cTBS induced a more avoidance-guided behavior. FMRI results showed enhanced prediction error coding in the ventral striatum in subjects stimulated over the left as compared to the right DLPFC. Both behavioral and imaging results are in line with recent findings that left, but not right-hemispherical stimulation can trigger a release of dopamine in the ventral striatum, which has been suggested to increase the relative impact of rewards rather than punishment on behavior. Copyright © 2011 Elsevier Inc. All rights reserved.

  18. Scheduled power tracking control of the wind-storage hybrid system based on the reinforcement learning theory

    Science.gov (United States)

    Li, Ze

    2017-09-01

    In allusion to the intermittency and uncertainty of the wind electricity, energy storage and wind generator are combined into a hybrid system to improve the controllability of the output power. A scheduled power tracking control method is proposed based on the reinforcement learning theory and Q-learning algorithm. In this method, the state space of the environment is formed with two key factors, i.e. the state of charge of the energy storage and the difference value between the actual wind power and scheduled power, the feasible action is the output power of the energy storage, and the corresponding immediate rewarding function is designed to reflect the rationality of the control action. By interacting with the environment and learning from the immediate reward, the optimal control strategy is gradually formed. After that, it could be applied to the scheduled power tracking control of the hybrid system. Finally, the rationality and validity of the method are verified through simulation examples.

  19. Reinforcement learning modulates the stability of cognitive control settings for object selection

    Directory of Open Access Journals (Sweden)

    Anthony William Sali

    2013-12-01

    Full Text Available Cognitive flexibility reflects both a trait that reliably differs between individuals and a state that can fluctuate moment-to-moment. Whether individuals can undergo persistent changes in cognitive flexibility as a result of reward learning is less understood. Here, we investigated whether reinforcing a periodic shift in an object selection strategy can make an individual more prone to switch strategies in a subsequent unrelated task. Participants completed two different choice tasks in which they selected one of four objects in an attempt to obtain a hidden reward on each trial. During a training phase, objects were defined by color. Participants received either consistent reward contingencies in which one color was more often rewarded, or contingencies in which the color that was more often rewarded changed periodically and without warning. Following the training phase, all participants completed a test phase in which reward contingencies were defined by spatial location and the location that was more often rewarded remained constant across the entire task. Those participants who received inconsistent contingencies during training continued to make more variable selections during the test phase in comparison to those who received the consistent training. Furthermore, a difference in the likelihood to switch selections on a trial-by-trial basis emerged between training groups: participants who received consistent contingencies during training were less likely to switch object selections following an unrewarded trial and more likely to repeat a selection following reward. Our findings provide evidence that the extent to which priority shifting is reinforced modulates the stability of cognitive control settings in a persistent manner, such that individuals become generally more or less prone to shifting priorities in the future.

  20. Sequential decisions: a computational comparison of observational and reinforcement accounts.

    Directory of Open Access Journals (Sweden)

    Nazanin Mohammadi Sepahvand

    Full Text Available Right brain damaged patients show impairments in sequential decision making tasks for which healthy people do not show any difficulty. We hypothesized that this difficulty could be due to the failure of right brain damage patients to develop well-matched models of the world. Our motivation is the idea that to navigate uncertainty, humans use models of the world to direct the decisions they make when interacting with their environment. The better the model is, the better their decisions are. To explore the model building and updating process in humans and the basis for impairment after brain injury, we used a computational model of non-stationary sequence learning. RELPH (Reinforcement and Entropy Learned Pruned Hypothesis space was able to qualitatively and quantitatively reproduce the results of left and right brain damaged patient groups and healthy controls playing a sequential version of Rock, Paper, Scissors. Our results suggests that, in general, humans employ a sub-optimal reinforcement based learning method rather than an objectively better statistical learning approach, and that differences between right brain damaged and healthy control groups can be explained by different exploration policies, rather than qualitatively different learning mechanisms.

  1. From Creatures of Habit to Goal-Directed Learners: Tracking the Developmental Emergence of Model-Based Reinforcement Learning.

    Science.gov (United States)

    Decker, Johannes H; Otto, A Ross; Daw, Nathaniel D; Hartley, Catherine A

    2016-06-01

    Theoretical models distinguish two decision-making strategies that have been formalized in reinforcement-learning theory. A model-based strategy leverages a cognitive model of potential actions and their consequences to make goal-directed choices, whereas a model-free strategy evaluates actions based solely on their reward history. Research in adults has begun to elucidate the psychological mechanisms and neural substrates underlying these learning processes and factors that influence their relative recruitment. However, the developmental trajectory of these evaluative strategies has not been well characterized. In this study, children, adolescents, and adults performed a sequential reinforcement-learning task that enabled estimation of model-based and model-free contributions to choice. Whereas a model-free strategy was apparent in choice behavior across all age groups, a model-based strategy was absent in children, became evident in adolescents, and strengthened in adults. These results suggest that recruitment of model-based valuation systems represents a critical cognitive component underlying the gradual maturation of goal-directed behavior. © The Author(s) 2016.

  2. Reinforcement Learning for Routing in Cognitive Radio Ad Hoc Networks

    Directory of Open Access Journals (Sweden)

    Hasan A. A. Al-Rawi

    2014-01-01

    Full Text Available Cognitive radio (CR enables unlicensed users (or secondary users, SUs to sense for and exploit underutilized licensed spectrum owned by the licensed users (or primary users, PUs. Reinforcement learning (RL is an artificial intelligence approach that enables a node to observe, learn, and make appropriate decisions on action selection in order to maximize network performance. Routing enables a source node to search for a least-cost route to its destination node. While there have been increasing efforts to enhance the traditional RL approach for routing in wireless networks, this research area remains largely unexplored in the domain of routing in CR networks. This paper applies RL in routing and investigates the effects of various features of RL (i.e., reward function, exploitation, and exploration, as well as learning rate through simulation. New approaches and recommendations are proposed to enhance the features in order to improve the network performance brought about by RL to routing. Simulation results show that the RL parameters of the reward function, exploitation, and exploration, as well as learning rate, must be well regulated, and the new approaches proposed in this paper improves SUs’ network performance without significantly jeopardizing PUs’ network performance, specifically SUs’ interference to PUs.

  3. Learning Agent for a Heat-Pump Thermostat with a Set-Back Strategy Using Model-Free Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Frederik Ruelens

    2015-08-01

    Full Text Available The conventional control paradigm for a heat pump with a less efficient auxiliary heating element is to keep its temperature set point constant during the day. This constant temperature set point ensures that the heat pump operates in its more efficient heat-pump mode and minimizes the risk of activating the less efficient auxiliary heating element. As an alternative to a constant set-point strategy, this paper proposes a learning agent for a thermostat with a set-back strategy. This set-back strategy relaxes the set-point temperature during convenient moments, e.g., when the occupants are not at home. Finding an optimal set-back strategy requires solving a sequential decision-making process under uncertainty, which presents two challenges. The first challenge is that for most residential buildings, a description of the thermal characteristics of the building is unavailable and challenging to obtain. The second challenge is that the relevant information on the state, i.e., the building envelope, cannot be measured by the learning agent. In order to overcome these two challenges, our paper proposes an auto-encoder coupled with a batch reinforcement learning technique. The proposed approach is validated for two building types with different thermal characteristics for heating in the winter and cooling in the summer. The simulation results indicate that the proposed learning agent can reduce the energy consumption by 4%–9% during 100 winter days and by 9%–11% during 80 summer days compared to the conventional constant set-point strategy.

  4. Space Objects Maneuvering Detection and Prediction via Inverse Reinforcement Learning

    Science.gov (United States)

    Linares, R.; Furfaro, R.

    This paper determines the behavior of Space Objects (SOs) using inverse Reinforcement Learning (RL) to estimate the reward function that each SO is using for control. The approach discussed in this work can be used to analyze maneuvering of SOs from observational data. The inverse RL problem is solved using the Feature Matching approach. This approach determines the optimal reward function that a SO is using while maneuvering by assuming that the observed trajectories are optimal with respect to the SO's own reward function. This paper uses estimated orbital elements data to determine the behavior of SOs in a data-driven fashion.

  5. Auditory temporal perceptual learning and transfer in Chinese-speaking children with developmental dyslexia.

    Science.gov (United States)

    Zhang, Manli; Xie, Weiyi; Xu, Yanzhi; Meng, Xiangzhi

    2018-03-01

    Perceptual learning refers to the improvement of perceptual performance as a function of training. Recent studies found that auditory perceptual learning may improve phonological skills in individuals with developmental dyslexia in alphabetic writing system. However, whether auditory perceptual learning could also benefit the reading skills of those learning the Chinese logographic writing system is, as yet, unknown. The current study aimed to investigate the remediation effect of auditory temporal perceptual learning on Mandarin-speaking school children with developmental dyslexia. Thirty children with dyslexia were screened from a large pool of students in 3th-5th grades. They completed a series of pretests and then were assigned to either a non-training control group or a training group. The training group worked on a pure tone duration discrimination task for 7 sessions over 2 weeks with thirty minutes per session. Post-tests immediately after training and a follow-up test 2 months later were conducted. Analyses revealed a significant training effect in the training group relative to non-training group, as well as near transfer to the temporal interval discrimination task and far transfer to phonological awareness, character recognition and reading fluency. Importantly, the training effect and all the transfer effects were stable at the 2-month follow-up session. Further analyses found that a significant correlation between character recognition performance and learning rate mainly existed in the slow learning phase, the consolidation stage of perceptual learning, and this effect was modulated by an individuals' executive function. These findings indicate that adaptive auditory temporal perceptual learning can lead to learning and transfer effects on reading performance, and shed further light on the potential role of basic perceptual learning in the remediation and prevention of developmental dyslexia. Copyright © 2018 Elsevier Ltd. All rights reserved.

  6. The probability of reinforcement per trial affects posttrial responding and subsequent extinction but not within-trial responding.

    Science.gov (United States)

    Harris, Justin A; Kwok, Dorothy W S

    2018-01-01

    During magazine approach conditioning, rats do not discriminate between a conditional stimulus (CS) that is consistently reinforced with food and a CS that is occasionally (partially) reinforced, as long as the CSs have the same overall reinforcement rate per second. This implies that rats are indifferent to the probability of reinforcement per trial. However, in the same rats, the per-trial reinforcement rate will affect subsequent extinction-responding extinguishes more rapidly for a CS that was consistently reinforced than for a partially reinforced CS. Here, we trained rats with consistently and partially reinforced CSs that were matched for overall reinforcement rate per second. We measured conditioned responding both during and immediately after the CSs. Differences in the per-trial probability of reinforcement did not affect the acquisition of responding during the CS but did affect subsequent extinction of that responding, and also affected the post-CS response rates during conditioning. Indeed, CSs with the same probability of reinforcement per trial evoked the same amount of post-CS responding even when they differed in overall reinforcement rate and thus evoked different amounts of responding during the CS. We conclude that reinforcement rate per second controls rats' acquisition of responding during the CS, but at the same time, rats also learn specifically about the probability of reinforcement per trial. The latter learning affects the rats' expectation of reinforcement as an outcome of the trial, which influences their ability to detect retrospectively that an opportunity for reinforcement was missed, and, in turn, drives extinction. (PsycINFO Database Record (c) 2018 APA, all rights reserved).

  7. Continuous theta-burst stimulation (cTBS) over the lateral prefrontal cortex alters reinforcement learning bias

    NARCIS (Netherlands)

    Ott, D.V.M.; Ullsperger, M.; Jocham, G.; Neumann, J.; Klein, T.A.

    2011-01-01

    The prefrontal cortex is known to play a key role in higher-order cognitive functions. Recently, we showed that this brain region is active in reinforcement learning, during which subjects constantly have to integrate trial outcomes in order to optimize performance. To further elucidate the role of

  8. Individual Learner Differences In Web-based Learning Environments: From Cognitive, Affective and Social-cultural Perspectives

    Directory of Open Access Journals (Sweden)

    Mustafa KOC

    2005-10-01

    Full Text Available Individual Learner DifferencesIn Web-based Learning Environments:From Cognitive, Affective and Social-cultural Perspectives Mustafa KOCPh.D Candidate Instructional TechnologyUniversity of Illinois at Urbana-ChampaignUrbana, IL - USA ABSTRACT Throughout the paper, the issues of individual differences in web-based learning, also known as online instruction, online training or distance education were examined and implications for designing distance education were discussed. Although the main purpose was to identify differences in learners’ characteristics such as cognitive, affective, physiological and social factors that affect learning in a web-enhanced environment, the questions of how the web could be used to reinforce learning, what kinds of development ideas, theories and models are currently being used to design and deliver online instruction, and finally what evidence for the effectiveness of using World Wide Web (WWW for learning and instruction has been reported, were also analyzed to extend theoretical and epistemogical understanding of web-based learning.

  9. Resting-state low-frequency fluctuations reflect individual differences in spoken language learning.

    Science.gov (United States)

    Deng, Zhizhou; Chandrasekaran, Bharath; Wang, Suiping; Wong, Patrick C M

    2016-03-01

    A major challenge in language learning studies is to identify objective, pre-training predictors of success. Variation in the low-frequency fluctuations (LFFs) of spontaneous brain activity measured by resting-state functional magnetic resonance imaging (RS-fMRI) has been found to reflect individual differences in cognitive measures. In the present study, we aimed to investigate the extent to which initial spontaneous brain activity is related to individual differences in spoken language learning. We acquired RS-fMRI data and subsequently trained participants on a sound-to-word learning paradigm in which they learned to use foreign pitch patterns (from Mandarin Chinese) to signal word meaning. We performed amplitude of spontaneous low-frequency fluctuation (ALFF) analysis, graph theory-based analysis, and independent component analysis (ICA) to identify functional components of the LFFs in the resting-state. First, we examined the ALFF as a regional measure and showed that regional ALFFs in the left superior temporal gyrus were positively correlated with learning performance, whereas ALFFs in the default mode network (DMN) regions were negatively correlated with learning performance. Furthermore, the graph theory-based analysis indicated that the degree and local efficiency of the left superior temporal gyrus were positively correlated with learning performance. Finally, the default mode network and several task-positive resting-state networks (RSNs) were identified via the ICA. The "competition" (i.e., negative correlation) between the DMN and the dorsal attention network was negatively correlated with learning performance. Our results demonstrate that a) spontaneous brain activity can predict future language learning outcome without prior hypotheses (e.g., selection of regions of interest--ROIs) and b) both regional dynamics and network-level interactions in the resting brain can account for individual differences in future spoken language learning success

  10. Resting-state low-frequency fluctuations reflect individual differences in spoken language learning

    Science.gov (United States)

    Deng, Zhizhou; Chandrasekaran, Bharath; Wang, Suiping; Wong, Patrick C.M.

    2016-01-01

    A major challenge in language learning studies is to identify objective, pre-training predictors of success. Variation in the low-frequency fluctuations (LFFs) of spontaneous brain activity measured by resting-state functional magnetic resonance imaging (RS-fMRI) has been found to reflect individual differences in cognitive measures. In the present study, we aimed to investigate the extent to which initial spontaneous brain activity is related to individual differences in spoken language learning. We acquired RS-fMRI data and subsequently trained participants on a sound-to-word learning paradigm in which they learned to use foreign pitch patterns (from Mandarin Chinese) to signal word meaning. We performed amplitude of spontaneous low-frequency fluctuation (ALFF) analysis, graph theory-based analysis, and independent component analysis (ICA) to identify functional components of the LFFs in the resting-state. First, we examined the ALFF as a regional measure and showed that regional ALFFs in the left superior temporal gyrus were positively correlated with learning performance, whereas ALFFs in the default mode network (DMN) regions were negatively correlated with learning performance. Furthermore, the graph theory-based analysis indicated that the degree and local efficiency of the left superior temporal gyrus were positively correlated with learning performance. Finally, the default mode network and several task-positive resting-state networks (RSNs) were identified via the ICA. The “competition” (i.e., negative correlation) between the DMN and the dorsal attention network was negatively correlated with learning performance. Our results demonstrate that a) spontaneous brain activity can predict future language learning outcome without prior hypotheses (e.g., selection of regions of interest – ROIs) and b) both regional dynamics and network-level interactions in the resting brain can account for individual differences in future spoken language learning success

  11. Learning temporal context shapes prestimulus alpha oscillations and improves visual discrimination performance.

    Science.gov (United States)

    Toosi, Tahereh; K Tousi, Ehsan; Esteky, Hossein

    2017-08-01

    Time is an inseparable component of every physical event that we perceive, yet it is not clear how the brain processes time or how the neuronal representation of time affects our perception of events. Here we asked subjects to perform a visual discrimination task while we changed the temporal context in which the stimuli were presented. We collected electroencephalography (EEG) signals in two temporal contexts. In predictable blocks stimuli were presented after a constant delay relative to a visual cue, and in unpredictable blocks stimuli were presented after variable delays relative to the visual cue. Four subsecond delays of 83, 150, 400, and 800 ms were used in the predictable and unpredictable blocks. We observed that predictability modulated the power of prestimulus alpha oscillations in the parieto-occipital sites: alpha power increased in the 300-ms window before stimulus onset in the predictable blocks compared with the unpredictable blocks. This modulation only occurred in the longest delay period, 800 ms, in which predictability also improved the behavioral performance of the subjects. Moreover, learning the temporal context shaped the prestimulus alpha power: modulation of prestimulus alpha power grew during the predictable block and correlated with performance enhancement. These results suggest that the brain is able to learn the subsecond temporal context of stimuli and use this to enhance sensory processing. Furthermore, the neural correlate of this temporal prediction is reflected in the alpha oscillations. NEW & NOTEWORTHY It is not well understood how the uncertainty in the timing of an external event affects its processing, particularly at subsecond scales. Here we demonstrate how a predictable timing scheme improves visual processing. We found that learning the predictable scheme gradually shaped the prestimulus alpha power. These findings indicate that the human brain is able to extract implicit subsecond patterns in the temporal context of

  12. DAT1-Genotype and Menstrual Cycle, but Not Hormonal Contraception, Modulate Reinforcement Learning: Preliminary Evidence.

    Science.gov (United States)

    Jakob, Kristina; Ehrentreich, Hanna; Holtfrerich, Sarah K C; Reimers, Luise; Diekhof, Esther K

    2018-01-01

    Hormone by genotype interactions have been widely ignored by cognitive neuroscience. Yet, the dependence of cognitive performance on both baseline dopamine (DA) and current 17ß-estradiol (E2) level argues for their combined effect also in the context of reinforcement learning. Here, we assessed how the interaction between the natural rise of E2 in the late follicular phase (FP) and the 40 base-pair variable number tandem repeat polymorphism of the dopamine transporter (DAT1) affects reinforcement learning capacity. 30 women with a regular menstrual cycle performed a probabilistic feedback learning task twice during the early and late FP. In addition, 39 women, who took hormonal contraceptives (HC) to suppress natural ovulation, were tested during the "pill break" and the intake phase of HC. The present data show that DAT1-genotype may interact with transient hormonal state, but only in women with a natural menstrual cycle. We found that carriers of the 9-repeat allele (9RP) experienced a significant decrease in the ability to avoid punishment from early to late FP. Neither homozygote subjects of the 10RP allele, nor subjects from the HC group showed a change in behavior between phases. These data are consistent with neurobiological studies that found that rising E2 may reverse DA transporter function and could enhance DA efflux, which would in turn reduce punishment sensitivity particularly in subjects with a higher transporter density to begin with. Taken together, the present results, although based on a small sample, add to the growing understanding of the complex interplay between different physiological modulators of dopaminergic transmission. They may not only point out the necessity to control for hormonal state in behavioral genetic research, but may offer new starting points for studies in clinical settings.

  13. DAT1-Genotype and Menstrual Cycle, but Not Hormonal Contraception, Modulate Reinforcement Learning: Preliminary Evidence

    Directory of Open Access Journals (Sweden)

    Kristina Jakob

    2018-02-01

    Full Text Available Hormone by genotype interactions have been widely ignored by cognitive neuroscience. Yet, the dependence of cognitive performance on both baseline dopamine (DA and current 17ß-estradiol (E2 level argues for their combined effect also in the context of reinforcement learning. Here, we assessed how the interaction between the natural rise of E2 in the late follicular phase (FP and the 40 base-pair variable number tandem repeat polymorphism of the dopamine transporter (DAT1 affects reinforcement learning capacity. 30 women with a regular menstrual cycle performed a probabilistic feedback learning task twice during the early and late FP. In addition, 39 women, who took hormonal contraceptives (HC to suppress natural ovulation, were tested during the “pill break” and the intake phase of HC. The present data show that DAT1-genotype may interact with transient hormonal state, but only in women with a natural menstrual cycle. We found that carriers of the 9-repeat allele (9RP experienced a significant decrease in the ability to avoid punishment from early to late FP. Neither homozygote subjects of the 10RP allele, nor subjects from the HC group showed a change in behavior between phases. These data are consistent with neurobiological studies that found that rising E2 may reverse DA transporter function and could enhance DA efflux, which would in turn reduce punishment sensitivity particularly in subjects with a higher transporter density to begin with. Taken together, the present results, although based on a small sample, add to the growing understanding of the complex interplay between different physiological modulators of dopaminergic transmission. They may not only point out the necessity to control for hormonal state in behavioral genetic research, but may offer new starting points for studies in clinical settings.

  14. Optimal medication dosing from suboptimal clinical examples: a deep reinforcement learning approach.

    Science.gov (United States)

    Nemati, Shamim; Ghassemi, Mohammad M; Clifford, Gari D

    2016-08-01

    Misdosing medications with sensitive therapeutic windows, such as heparin, can place patients at unnecessary risk, increase length of hospital stay, and lead to wasted hospital resources. In this work, we present a clinician-in-the-loop sequential decision making framework, which provides an individualized dosing policy adapted to each patient's evolving clinical phenotype. We employed retrospective data from the publicly available MIMIC II intensive care unit database, and developed a deep reinforcement learning algorithm that learns an optimal heparin dosing policy from sample dosing trails and their associated outcomes in large electronic medical records. Using separate training and testing datasets, our model was observed to be effective in proposing heparin doses that resulted in better expected outcomes than the clinical guidelines. Our results demonstrate that a sequential modeling approach, learned from retrospective data, could potentially be used at the bedside to derive individualized patient dosing policies.

  15. Off-policy integral reinforcement learning optimal tracking control for continuous-time chaotic systems

    International Nuclear Information System (INIS)

    Wei Qing-Lai; Song Rui-Zhuo; Xiao Wen-Dong; Sun Qiu-Ye

    2015-01-01

    This paper estimates an off-policy integral reinforcement learning (IRL) algorithm to obtain the optimal tracking control of unknown chaotic systems. Off-policy IRL can learn the solution of the HJB equation from the system data generated by an arbitrary control. Moreover, off-policy IRL can be regarded as a direct learning method, which avoids the identification of system dynamics. In this paper, the performance index function is first given based on the system tracking error and control error. For solving the Hamilton–Jacobi–Bellman (HJB) equation, an off-policy IRL algorithm is proposed. It is proven that the iterative control makes the tracking error system asymptotically stable, and the iterative performance index function is convergent. Simulation study demonstrates the effectiveness of the developed tracking control method. (paper)

  16. A reinforcement learning model of joy, distress, hope and fear

    Science.gov (United States)

    Broekens, Joost; Jacobs, Elmer; Jonker, Catholijn M.

    2015-07-01

    In this paper we computationally study the relation between adaptive behaviour and emotion. Using the reinforcement learning framework, we propose that learned state utility, ?, models fear (negative) and hope (positive) based on the fact that both signals are about anticipation of loss or gain. Further, we propose that joy/distress is a signal similar to the error signal. We present agent-based simulation experiments that show that this model replicates psychological and behavioural dynamics of emotion. This work distinguishes itself by assessing the dynamics of emotion in an adaptive agent framework - coupling it to the literature on habituation, development, extinction and hope theory. Our results support the idea that the function of emotion is to provide a complex feedback signal for an organism to adapt its behaviour. Our work is relevant for understanding the relation between emotion and adaptation in animals, as well as for human-robot interaction, in particular how emotional signals can be used to communicate between adaptive agents and humans.

  17. Depression, Activity, and Evaluation of Reinforcement

    Science.gov (United States)

    Hammen, Constance L.; Glass, David R., Jr.

    1975-01-01

    This research attempted to find the causal relation between mood and level of reinforcement. An effort was made to learn what mood change might occur if depressed subjects increased their levels of participation in reinforcing activities. (Author/RK)

  18. Pareto Optimal Solutions for Network Defense Strategy Selection Simulator in Multi-Objective Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Yang Sun

    2018-01-01

    Full Text Available Using Pareto optimization in Multi-Objective Reinforcement Learning (MORL leads to better learning results for network defense games. This is particularly useful for network security agents, who must often balance several goals when choosing what action to take in defense of a network. If the defender knows his preferred reward distribution, the advantages of Pareto optimization can be retained by using a scalarization algorithm prior to the implementation of the MORL. In this paper, we simulate a network defense scenario by creating a multi-objective zero-sum game and using Pareto optimization and MORL to determine optimal solutions and compare those solutions to different scalarization approaches. We build a Pareto Defense Strategy Selection Simulator (PDSSS system for assisting network administrators on decision-making, specifically, on defense strategy selection, and the experiment results show that the Satisficing Trade-Off Method (STOM scalarization approach performs better than linear scalarization or GUESS method. The results of this paper can aid network security agents attempting to find an optimal defense policy for network security games.

  19. A Plant Control Technology Using Reinforcement Learning Method with Automatic Reward Adjustment

    Science.gov (United States)

    Eguchi, Toru; Sekiai, Takaaki; Yamada, Akihiro; Shimizu, Satoru; Fukai, Masayuki

    A control technology using Reinforcement Learning (RL) and Radial Basis Function (RBF) Network has been developed to reduce environmental load substances exhausted from power and industrial plants. This technology consists of the statistic model using RBF Network, which estimates characteristics of plants with respect to environmental load substances, and RL agent, which learns the control logic for the plants using the statistic model. In this technology, it is necessary to design an appropriate reward function given to the agent immediately according to operation conditions and control goals to control plants flexibly. Therefore, we propose an automatic reward adjusting method of RL for plant control. This method adjusts the reward function automatically using information of the statistic model obtained in its learning process. In the simulations, it is confirmed that the proposed method can adjust the reward function adaptively for several test functions, and executes robust control toward the thermal power plant considering the change of operation conditions and control goals.

  20. Temporal Discontiguity Is neither Necessary nor Sufficient for Learning-Induced Effects on Adult Neurogenesis

    Science.gov (United States)

    Leuner, Benedetta; Waddell, Jaylyn; Gould, Elizabeth; Shors, Tracey J.

    2012-01-01

    Some, but not all, types of learning and memory can influence neurogenesis in the adult hippocampus. Trace eyeblink conditioning has been shown to enhance the survival of new neurons, whereas delay eyeblink conditioning has no such effect. The key difference between the two training procedures is that the conditioning stimuli are separated in time during trace but not delay conditioning. These findings raise the question of whether temporal discontiguity is necessary for enhancing the survival of new neurons. Here we used two approaches to test this hypothesis. First, we examined the influence of a delay conditioning task in which the duration of the conditioned stimulus (CS) was increased nearly twofold, a procedure that critically engages the hippocampus. Although the CS and unconditioned stimulus are contiguous, this very long delay conditioning procedure increased the number of new neurons that survived. Second, we examined the influence of learning the trace conditioned response (CR) after having acquired the CR during delay conditioning, a procedure that renders trace conditioning hippocampal-independent. In this case, trace conditioning did not enhance the survival of new neurons. Together, these results demonstrate that associative learning increases the survival of new neurons in the adult hippocampus, regardless of temporal contiguity. PMID:17192426

  1. Finite Element Analysis of Increasing Column Section and CFRP Reinforcement Method under Different Axial Compression Ratio

    Science.gov (United States)

    Jinghai, Zhou; Tianbei, Kang; Fengchi, Wang; Xindong, Wang

    2017-11-01

    Eight less stirrups in the core area frame joints are simulated by ABAQUS finite element numerical software. The composite reinforcement method is strengthened with carbon fiber and increasing column section, the axial compression ratio of reinforced specimens is 0.3, 0.45 and 0.6 respectively. The results of the load-displacement curve, ductility and stiffness are analyzed, and it is found that the different axial compression ratio has great influence on the bearing capacity of increasing column section strengthening method, and has little influence on carbon fiber reinforcement method. The different strengthening schemes improve the ultimate bearing capacity and ductility of frame joints in a certain extent, composite reinforcement joints strengthening method to improve the most significant, followed by increasing column section, reinforcement method of carbon fiber reinforced joints to increase the minimum.

  2. Brain Circuits of Methamphetamine Place Reinforcement Learning: The Role of the Hippocampus-VTA Loop.

    Science.gov (United States)

    Keleta, Yonas B; Martinez, Joe L

    2012-03-01

    The reinforcing effects of addictive drugs including methamphetamine (METH) involve the midbrain ventral tegmental area (VTA). VTA is primary source of dopamine (DA) to the nucleus accumbens (NAc) and the ventral hippocampus (VHC). These three brain regions are functionally connected through the hippocampal-VTA loop that includes two main neural pathways: the bottom-up pathway and the top-down pathway. In this paper, we take the view that addiction is a learning process. Therefore, we tested the involvement of the hippocampus in reinforcement learning by studying conditioned place preference (CPP) learning by sequentially conditioning each of the three nuclei in either the bottom-up order of conditioning; VTA, then VHC, finally NAc, or the top-down order; VHC, then VTA, finally NAc. Following habituation, the rats underwent experimental modules consisting of two conditioning trials each followed by immediate testing (test 1 and test 2) and two additional tests 24 h (test 3) and/or 1 week following conditioning (test 4). The module was repeated three times for each nucleus. The results showed that METH, but not Ringer's, produced positive CPP following conditioning each brain area in the bottom-up order. In the top-down order, METH, but not Ringer's, produced either an aversive CPP or no learning effect following conditioning each nucleus of interest. In addition, METH place aversion was antagonized by coadministration of the N-methyl-d-aspartate (NMDA) receptor antagonist MK801, suggesting that the aversion learning was an NMDA receptor activation-dependent process. We conclude that the hippocampus is a critical structure in the reward circuit and hence suggest that the development of target-specific therapeutics for the control of addiction emphasizes on the hippocampus-VTA top-down connection.

  3. Reinforcement Magnitude: An Evaluation of Preference and Reinforcer Efficacy

    OpenAIRE

    Trosclair-Lasserre, Nicole M; Lerman, Dorothea C; Call, Nathan A; Addison, Laura R; Kodak, Tiffany

    2008-01-01

    Consideration of reinforcer magnitude may be important for maximizing the efficacy of treatment for problem behavior. Nonetheless, relatively little is known about children's preferences for different magnitudes of social reinforcement or the extent to which preference is related to differences in reinforcer efficacy. The purpose of the current study was to evaluate the relations among reinforcer magnitude, preference, and efficacy by drawing on the procedures and results of basic experimenta...

  4. Corticostriatal circuit mechanisms of value-based action selection: Implementation of reinforcement learning algorithms and beyond.

    Science.gov (United States)

    Morita, Kenji; Jitsev, Jenia; Morrison, Abigail

    2016-09-15

    Value-based action selection has been suggested to be realized in the corticostriatal local circuits through competition among neural populations. In this article, we review theoretical and experimental studies that have constructed and verified this notion, and provide new perspectives on how the local-circuit selection mechanisms implement reinforcement learning (RL) algorithms and computations beyond them. The striatal neurons are mostly inhibitory, and lateral inhibition among them has been classically proposed to realize "Winner-Take-All (WTA)" selection of the maximum-valued action (i.e., 'max' operation). Although this view has been challenged by the revealed weakness, sparseness, and asymmetry of lateral inhibition, which suggest more complex dynamics, WTA-like competition could still occur on short time scales. Unlike the striatal circuit, the cortical circuit contains recurrent excitation, which may enable retention or temporal integration of information and probabilistic "soft-max" selection. The striatal "max" circuit and the cortical "soft-max" circuit might co-implement an RL algorithm called Q-learning; the cortical circuit might also similarly serve for other algorithms such as SARSA. In these implementations, the cortical circuit presumably sustains activity representing the executed action, which negatively impacts dopamine neurons so that they can calculate reward-prediction-error. Regarding the suggested more complex dynamics of striatal, as well as cortical, circuits on long time scales, which could be viewed as a sequence of short WTA fragments, computational roles remain open: such a sequence might represent (1) sequential state-action-state transitions, constituting replay or simulation of the internal model, (2) a single state/action by the whole trajectory, or (3) probabilistic sampling of state/action. Copyright © 2016. Published by Elsevier B.V.

  5. Learning Object Names at Different Hierarchical Levels Using Cross-Situational Statistics.

    Science.gov (United States)

    Chen, Chi-Hsin; Zhang, Yayun; Yu, Chen

    2018-05-01

    Objects in the world usually have names at different hierarchical levels (e.g., beagle, dog, animal). This research investigates adults' ability to use cross-situational statistics to simultaneously learn object labels at individual and category levels. The results revealed that adults were able to use co-occurrence information to learn hierarchical labels in contexts where the labels for individual objects and labels for categories were presented in completely separated blocks, in interleaved blocks, or mixed in the same trial. Temporal presentation schedules significantly affected the learning of individual object labels, but not the learning of category labels. Learners' subsequent generalization of category labels indicated sensitivity to the structure of statistical input. Copyright © 2017 Cognitive Science Society, Inc.

  6. Reinforcement Magnitude: An Evaluation of Preference and Reinforcer Efficacy

    Science.gov (United States)

    Trosclair-Lasserre, Nicole M.; Lerman, Dorothea C.; Call, Nathan A.; Addison, Laura R.; Kodak, Tiffany

    2008-01-01

    Consideration of reinforcer magnitude may be important for maximizing the efficacy of treatment for problem behavior. Nonetheless, relatively little is known about children's preferences for different magnitudes of social reinforcement or the extent to which preference is related to differences in reinforcer efficacy. The purpose of the current…

  7. Reinforced flexural elements for TEMP-STRESS Program

    International Nuclear Information System (INIS)

    Marchertas, A.H.; Kennedy, J.M.; Pfeiffer, P.A.

    1987-06-01

    The implementation of reinforced flexural elements into the thermal-mechanical finite element program TEMP-STRESS is described. With explicit temporal integration and dynamic relaxation capabilities in the program, the flexural elements provide an efficient method for the treatment of reinforced structures subjected to transient and static loads. The capability of the computer program is illustrated by the solution of several examples: the simulation of a reinforced concrete beam; simulations of a reinforced concrete containment shell which is subjected to internal pressurization, thermal gradients through the walls, and transient pressure loads. The results of this analysis are relevant in the structural design/safety evaluations of typical reactor containment structures. 22 refs., 13 figs

  8. Off-Policy Reinforcement Learning for Synchronization in Multiagent Graphical Games.

    Science.gov (United States)

    Li, Jinna; Modares, Hamidreza; Chai, Tianyou; Lewis, Frank L; Xie, Lihua

    2017-10-01

    This paper develops an off-policy reinforcement learning (RL) algorithm to solve optimal synchronization of multiagent systems. This is accomplished by using the framework of graphical games. In contrast to traditional control protocols, which require complete knowledge of agent dynamics, the proposed off-policy RL algorithm is a model-free approach, in that it solves the optimal synchronization problem without knowing any knowledge of the agent dynamics. A prescribed control policy, called behavior policy, is applied to each agent to generate and collect data for learning. An off-policy Bellman equation is derived for each agent to learn the value function for the policy under evaluation, called target policy, and find an improved policy, simultaneously. Actor and critic neural networks along with least-square approach are employed to approximate target control policies and value functions using the data generated by applying prescribed behavior policies. Finally, an off-policy RL algorithm is presented that is implemented in real time and gives the approximate optimal control policy for each agent using only measured data. It is shown that the optimal distributed policies found by the proposed algorithm satisfy the global Nash equilibrium and synchronize all agents to the leader. Simulation results illustrate the effectiveness of the proposed method.

  9. Finding intrinsic rewards by embodied evolution and constrained reinforcement learning.

    Science.gov (United States)

    Uchibe, Eiji; Doya, Kenji

    2008-12-01

    Understanding the design principle of reward functions is a substantial challenge both in artificial intelligence and neuroscience. Successful acquisition of a task usually requires not only rewards for goals, but also for intermediate states to promote effective exploration. This paper proposes a method for designing 'intrinsic' rewards of autonomous agents by combining constrained policy gradient reinforcement learning and embodied evolution. To validate the method, we use Cyber Rodent robots, in which collision avoidance, recharging from battery packs, and 'mating' by software reproduction are three major 'extrinsic' rewards. We show in hardware experiments that the robots can find appropriate 'intrinsic' rewards for the vision of battery packs and other robots to promote approach behaviors.

  10. Self-learning fuzzy controllers based on temporal back propagation

    Science.gov (United States)

    Jang, Jyh-Shing R.

    1992-01-01

    This paper presents a generalized control strategy that enhances fuzzy controllers with self-learning capability for achieving prescribed control objectives in a near-optimal manner. This methodology, termed temporal back propagation, is model-insensitive in the sense that it can deal with plants that can be represented in a piecewise-differentiable format, such as difference equations, neural networks, GMDH structures, and fuzzy models. Regardless of the numbers of inputs and outputs of the plants under consideration, the proposed approach can either refine the fuzzy if-then rules if human experts, or automatically derive the fuzzy if-then rules obtained from human experts are not available. The inverted pendulum system is employed as a test-bed to demonstrate the effectiveness of the proposed control scheme and the robustness of the acquired fuzzy controller.

  11. A sequence identification measurement model to investigate the implicit learning of metrical temporal patterns.

    Directory of Open Access Journals (Sweden)

    Benjamin G Schultz

    Full Text Available Implicit learning (IL occurs unconsciously and without intention. Perceptual fluency is the ease of processing elicited by previous exposure to a stimulus. It has been assumed that perceptual fluency is associated with IL. However, the role of perceptual fluency following IL has not been investigated in temporal pattern learning. Two experiments by Schultz, Stevens, Keller, and Tillmann demonstrated the IL of auditory temporal patterns using a serial reaction-time task and a generation task based on the process dissociation procedure. The generation task demonstrated that learning was implicit in both experiments via motor fluency, that is, the inability to suppress learned information. With the aim to disentangle conscious and unconscious processes, we analyze unreported recognition data associated with the Schultz et al. experiments using the sequence identification measurement model. The model assumes that perceptual fluency reflects unconscious processes and IL. For Experiment 1, the model indicated that conscious and unconscious processes contributed to recognition of temporal patterns, but that unconscious processes had a greater influence on recognition than conscious processes. In the model implementation of Experiment 2, there was equal contribution of conscious and unconscious processes in the recognition of temporal patterns. As Schultz et al. demonstrated IL in both experiments using a generation task, and the conditions reported here in Experiments 1 and 2 were identical, two explanations are offered for the discrepancy in model and behavioral results based on the two tasks: 1 perceptual fluency may not be necessary to infer IL, or 2 conscious control over implicitly learned information may vary as a function of perceptual fluency and motor fluency.

  12. Effectiveness of an educational video as an instrument to refresh and reinforce the learning of a nursing technique: a randomized controlled trial.

    Science.gov (United States)

    Salina, Loris; Ruffinengo, Carlo; Garrino, Lorenza; Massariello, Patrizia; Charrier, Lorena; Martin, Barbara; Favale, Maria Santina; Dimonte, Valerio

    2012-05-01

    The Undergraduate Nursing Course has been using videos for the past year or so. Videos are used for many different purposes such as during lessons, nurse refresher courses, reinforcement, and sharing and comparison of knowledge with the professional and scientific community. The purpose of this study was to estimate the efficacy of the video (moving an uncooperative patient from the supine to the lateral position) as an instrument to refresh and reinforce nursing techniques. A two-arm randomized controlled trial (RCT) design was chosen: both groups attended lessons in the classroom as well as in the laboratory; a month later while one group received written information as a refresher, the other group watched the video. Both groups were evaluated in a blinded fashion. A total of 223 students agreed to take part in the study. The difference observed between those who had seen the video and those who had read up on the technique turned out to be an average of 6.19 points in favour of the first (P video were better able to apply the technique, resulting in a better performance. The video, therefore, represents an important tool to refresh and reinforce previous learning.

  13. Learning rational temporal eye movement strategies.

    Science.gov (United States)

    Hoppe, David; Rothkopf, Constantin A

    2016-07-19

    During active behavior humans redirect their gaze several times every second within the visual environment. Where we look within static images is highly efficient, as quantified by computational models of human gaze shifts in visual search and face recognition tasks. However, when we shift gaze is mostly unknown despite its fundamental importance for survival in a dynamic world. It has been suggested that during naturalistic visuomotor behavior gaze deployment is coordinated with task-relevant events, often predictive of future events, and studies in sportsmen suggest that timing of eye movements is learned. Here we establish that humans efficiently learn to adjust the timing of eye movements in response to environmental regularities when monitoring locations in the visual scene to detect probabilistically occurring events. To detect the events humans adopt strategies that can be understood through a computational model that includes perceptual and acting uncertainties, a minimal processing time, and, crucially, the intrinsic costs of gaze behavior. Thus, subjects traded off event detection rate with behavioral costs of carrying out eye movements. Remarkably, based on this rational bounded actor model the time course of learning the gaze strategies is fully explained by an optimal Bayesian learner with humans' characteristic uncertainty in time estimation, the well-known scalar law of biological timing. Taken together, these findings establish that the human visual system is highly efficient in learning temporal regularities in the environment and that it can use these regularities to control the timing of eye movements to detect behaviorally relevant events.

  14. Improving Accuracy and Temporal Resolution of Learning Curve Estimation for within- and across-Session Analysis

    Science.gov (United States)

    Tabelow, Karsten; König, Reinhard; Polzehl, Jörg

    2016-01-01

    Estimation of learning curves is ubiquitously based on proportions of correct responses within moving trial windows. Thereby, it is tacitly assumed that learning performance is constant within the moving windows, which, however, is often not the case. In the present study we demonstrate that violations of this assumption lead to systematic errors in the analysis of learning curves, and we explored the dependency of these errors on window size, different statistical models, and learning phase. To reduce these errors in the analysis of single-subject data as well as on the population level, we propose adequate statistical methods for the estimation of learning curves and the construction of confidence intervals, trial by trial. Applied to data from an avoidance learning experiment with rodents, these methods revealed performance changes occurring at multiple time scales within and across training sessions which were otherwise obscured in the conventional analysis. Our work shows that the proper assessment of the behavioral dynamics of learning at high temporal resolution can shed new light on specific learning processes, and, thus, allows to refine existing learning concepts. It further disambiguates the interpretation of neurophysiological signal changes recorded during training in relation to learning. PMID:27303809

  15. Improving Accuracy and Temporal Resolution of Learning Curve Estimation for within- and across-Session Analysis.

    Directory of Open Access Journals (Sweden)

    Matthias Deliano

    Full Text Available Estimation of learning curves is ubiquitously based on proportions of correct responses within moving trial windows. Thereby, it is tacitly assumed that learning performance is constant within the moving windows, which, however, is often not the case. In the present study we demonstrate that violations of this assumption lead to systematic errors in the analysis of learning curves, and we explored the dependency of these errors on window size, different statistical models, and learning phase. To reduce these errors in the analysis of single-subject data as well as on the population level, we propose adequate statistical methods for the estimation of learning curves and the construction of confidence intervals, trial by trial. Applied to data from an avoidance learning experiment with rodents, these methods revealed performance changes occurring at multiple time scales within and across training sessions which were otherwise obscured in the conventional analysis. Our work shows that the proper assessment of the behavioral dynamics of learning at high temporal resolution can shed new light on specific learning processes, and, thus, allows to refine existing learning concepts. It further disambiguates the interpretation of neurophysiological signal changes recorded during training in relation to learning.

  16. Dopaminergic control of motivation and reinforcement learning: a closed-circuit account for reward-oriented behavior.

    Science.gov (United States)

    Morita, Kenji; Morishima, Mieko; Sakai, Katsuyuki; Kawaguchi, Yasuo

    2013-05-15

    Humans and animals take actions quickly when they expect that the actions lead to reward, reflecting their motivation. Injection of dopamine receptor antagonists into the striatum has been shown to slow such reward-seeking behavior, suggesting that dopamine is involved in the control of motivational processes. Meanwhile, neurophysiological studies have revealed that phasic response of dopamine neurons appears to represent reward prediction error, indicating that dopamine plays central roles in reinforcement learning. However, previous attempts to elucidate the mechanisms of these dopaminergic controls have not fully explained how the motivational and learning aspects are related and whether they can be understood by the way the activity of dopamine neurons itself is controlled by their upstream circuitries. To address this issue, we constructed a closed-circuit model of the corticobasal ganglia system based on recent findings regarding intracortical and corticostriatal circuit architectures. Simulations show that the model could reproduce the observed distinct motivational effects of D1- and D2-type dopamine receptor antagonists. Simultaneously, our model successfully explains the dopaminergic representation of reward prediction error as observed in behaving animals during learning tasks and could also explain distinct choice biases induced by optogenetic stimulation of the D1 and D2 receptor-expressing striatal neurons. These results indicate that the suggested roles of dopamine in motivational control and reinforcement learning can be understood in a unified manner through a notion that the indirect pathway of the basal ganglia represents the value of states/actions at a previous time point, an empirically driven key assumption of our model.

  17. The critical dimensions of the response-reinforcer contingency.

    Science.gov (United States)

    Williams, B A.

    2001-05-03

    Two major dimensions of any contingency of reinforcement are the temporal relation between a response and its reinforcer, and the relative frequency of the reinforcer given the response versus when the response has not occurred. Previous data demonstrate that time, per se, is not sufficient to explain the effects of delay-of-reinforcement procedures; needed in addition is some account of the events occurring in the delay interval. Moreover, the effects of the same absolute time values vary greatly across situations, such that any notion of a standard delay-of-reinforcement gradient is simplistic. The effects of reinforcers occurring in the absence of a response depend critically upon the stimulus conditions paired with those reinforcers, in much the same manner as has been shown with Pavlovian contingency effects. However, it is unclear whether the underlying basis of such effects is response competition or changes in the calculus of causation.

  18. Dynamic Resource Allocation with Integrated Reinforcement Learning for a D2D-Enabled LTE-A Network with Access to Unlicensed Band

    Directory of Open Access Journals (Sweden)

    Alia Asheralieva

    2016-01-01

    Full Text Available We propose a dynamic resource allocation algorithm for device-to-device (D2D communication underlying a Long Term Evolution Advanced (LTE-A network with reinforcement learning (RL applied for unlicensed channel allocation. In a considered system, the inband and outband resources are assigned by the LTE evolved NodeB (eNB to different device pairs to maximize the network utility subject to the target signal-to-interference-and-noise ratio (SINR constraints. Because of the absence of an established control link between the unlicensed and cellular radio interfaces, the eNB cannot acquire any information about the quality and availability of unlicensed channels. As a result, a considered problem becomes a stochastic optimization problem that can be dealt with by deploying a learning theory (to estimate the random unlicensed channel environment. Consequently, we formulate the outband D2D access as a dynamic single-player game in which the player (eNB estimates its possible strategy and expected utility for all of its actions based only on its own local observations using a joint utility and strategy estimation based reinforcement learning (JUSTE-RL with regret algorithm. A proposed approach for resource allocation demonstrates near-optimal performance after a small number of RL iterations and surpasses the other comparable methods in terms of energy efficiency and throughput maximization.

  19. Multi-temporal Land Use Mapping of Coastal Wetlands Area using Machine Learning in Google Earth Engine

    Science.gov (United States)

    Farda, N. M.

    2017-12-01

    Coastal wetlands provide ecosystem services essential to people and the environment. Changes in coastal wetlands, especially on land use, are important to monitor by utilizing multi-temporal imagery. The Google Earth Engine (GEE) provides many machine learning algorithms (10 algorithms) that are very useful for extracting land use from imagery. The research objective is to explore machine learning in Google Earth Engine and its accuracy for multi-temporal land use mapping of coastal wetland area. Landsat 3 MSS (1978), Landsat 5 TM (1991), Landsat 7 ETM+ (2001), and Landsat 8 OLI (2014) images located in Segara Anakan lagoon are selected to represent multi temporal images. The input for machine learning are visible and near infrared bands, PCA band, invers PCA bands, bare soil index, vegetation index, wetness index, elevation from ASTER GDEM, and GLCM (Harralick) texture, and also polygon samples in 140 locations. There are 10 machine learning algorithms applied to extract coastal wetlands land use from Landsat imagery. The algorithms are Fast Naive Bayes, CART (Classification and Regression Tree), Random Forests, GMO Max Entropy, Perceptron (Multi Class Perceptron), Winnow, Voting SVM, Margin SVM, Pegasos (Primal Estimated sub-GrAdient SOlver for Svm), IKPamir (Intersection Kernel Passive Aggressive Method for Information Retrieval, SVM). Machine learning in Google Earth Engine are very helpful in multi-temporal land use mapping, the highest accuracy for land use mapping of coastal wetland is CART with 96.98 % Overall Accuracy using K-Fold Cross Validation (K = 10). GEE is particularly useful for multi-temporal land use mapping with ready used image and classification algorithms, and also very challenging for other applications.

  20. Operant licking for intragastric sugar infusions: differential reinforcing actions of glucose, sucrose and fructose in mice

    Science.gov (United States)

    Sclafani, Anthony; Ackroff, Karen

    2015-01-01

    Intragastric (IG) flavor conditioning studies in rodents indicate that isocaloric sugar infusions differ in their reinforcing actions, with glucose and sucrose more potent than fructose. Here we determined if the sugars also differ in their ability to maintain operant self-administration by licking an empty spout for IG infusions. Food-restricted C57BL/6J mice were trained 1 h/day to lick a food-baited spout, which triggered IG infusions of 16% sucrose. In testing, the mice licked an empty spout, which triggered IG infusions of different sugars. Mice shifted from sucrose to 16% glucose increased dry licking, whereas mice shifted to 16% fructose rapidly reduced licking to low levels. Other mice shifted from sucrose to IG water reduced licking more slowly but reached the same low levels. Thus IG fructose, like water, is not reinforcing to hungry mice. The more rapid decline in licking induced by fructose may be due to the sugar's satiating effects. Further tests revealed that the Glucose mice increased their dry licking when shifted from 16% to 8% glucose, and reduced their dry licking when shifted to 32% glucose. This may reflect caloric regulation and/or differences in satiation. The Glucose mice did not maintain caloric intake when tested with different sugars. They self-infused less sugar when shifted from 16% glucose to 16% sucrose, and even more so when shifted to 16% fructose. Reduced sucrose self-administration may occur because the fructose component of the disaccharide reduces its reinforcing potency. FVB mice also reduced operant licking when tested with 16% fructose, yet learned to prefer a flavor paired with IG fructose. These data indicate that sugars differ substantially in their ability to support IG self-administration and flavor preference learning. The same post-oral reinforcement process appears to mediate operant licking and flavor learning, although flavor learning provides a more sensitive measure of sugar reinforcement. PMID:26485294

  1. Medial prefrontal cortex and the adaptive regulation of reinforcement learning parameters.

    Science.gov (United States)

    Khamassi, Mehdi; Enel, Pierre; Dominey, Peter Ford; Procyk, Emmanuel

    2013-01-01

    Converging evidence suggest that the medial prefrontal cortex (MPFC) is involved in feedback categorization, performance monitoring, and task monitoring, and may contribute to the online regulation of reinforcement learning (RL) parameters that would affect decision-making processes in the lateral prefrontal cortex (LPFC). Previous neurophysiological experiments have shown MPFC activities encoding error likelihood, uncertainty, reward volatility, as well as neural responses categorizing different types of feedback, for instance, distinguishing between choice errors and execution errors. Rushworth and colleagues have proposed that the involvement of MPFC in tracking the volatility of the task could contribute to the regulation of one of RL parameters called the learning rate. We extend this hypothesis by proposing that MPFC could contribute to the regulation of other RL parameters such as the exploration rate and default action values in case of task shifts. Here, we analyze the sensitivity to RL parameters of behavioral performance in two monkey decision-making tasks, one with a deterministic reward schedule and the other with a stochastic one. We show that there exist optimal parameter values specific to each of these tasks, that need to be found for optimal performance and that are usually hand-tuned in computational models. In contrast, automatic online regulation of these parameters using some heuristics can help producing a good, although non-optimal, behavioral performance in each task. We finally describe our computational model of MPFC-LPFC interaction used for online regulation of the exploration rate and its application to a human-robot interaction scenario. There, unexpected uncertainties are produced by the human introducing cued task changes or by cheating. The model enables the robot to autonomously learn to reset exploration in response to such uncertain cues and events. The combined results provide concrete evidence specifying how prefrontal

  2. Adaptive social learning strategies in temporally and spatially varying environments : how temporal vs. spatial variation, number of cultural traits, and costs of learning influence the evolution of conformist-biased transmission, payoff-biased transmission, and individual learning.

    Science.gov (United States)

    Nakahashi, Wataru; Wakano, Joe Yuichiro; Henrich, Joseph

    2012-12-01

    Long before the origins of agriculture human ancestors had expanded across the globe into an immense variety of environments, from Australian deserts to Siberian tundra. Survival in these environments did not principally depend on genetic adaptations, but instead on evolved learning strategies that permitted the assembly of locally adaptive behavioral repertoires. To develop hypotheses about these learning strategies, we have modeled the evolution of learning strategies to assess what conditions and constraints favor which kinds of strategies. To build on prior work, we focus on clarifying how spatial variability, temporal variability, and the number of cultural traits influence the evolution of four types of strategies: (1) individual learning, (2) unbiased social learning, (3) payoff-biased social learning, and (4) conformist transmission. Using a combination of analytic and simulation methods, we show that spatial-but not temporal-variation strongly favors the emergence of conformist transmission. This effect intensifies when migration rates are relatively high and individual learning is costly. We also show that increasing the number of cultural traits above two favors the evolution of conformist transmission, which suggests that the assumption of only two traits in many models has been conservative. We close by discussing how (1) spatial variability represents only one way of introducing the low-level, nonadaptive phenotypic trait variation that so favors conformist transmission, the other obvious way being learning errors, and (2) our findings apply to the evolution of conformist transmission in social interactions. Throughout we emphasize how our models generate empirical predictions suitable for laboratory testing.

  3. Effect of Different Fillers on Adhesive Wear Properties of Glass Fiber Reinforced Polyester Composites

    Directory of Open Access Journals (Sweden)

    E. Feyzullahoğlu

    2017-12-01

    Full Text Available Polymeric composites are used for different aims as substitute of traditional materials such as metals; due to their improved strength at small specific weight. The fiber reinforced polymer (FRP composite material consists of polymeric matrix and reinforcing material. Polymeric materials are commonly reinforced with synthetic fibers such as glass and carbon. The glass fiber reinforced polyester (GFRP composites are used with different filler materials. The aim of this study is to investigate the effects of different filler materials on adhesive wear behavior of GFRP. In this experimental study; polymetilmetacrilat (PMMA, Glass beads (GB and Glass sand (GS were used as filling material in GFRP composite samples. The adhesive wear behaviors of samples were carried out using ball on disc type tribometer. The friction force and coefficient of friction were measured during the test. The volume loss and wear rate values of samples were calculated according to test results. Barcol hardness values of samples were measured. The densities of samples were measured. Results show that the wear resistance of GB filled GFRP composite samples was much more than non-filled and PMMA filled GFRP composite samples.

  4. Auditory temporal-order thresholds show no gender differences

    NARCIS (Netherlands)

    van Kesteren, Marlieke T. R.; Wierslnca-Post, J. Esther C.

    2007-01-01

    Purpose: Several studies on auditory temporal-order processing showed gender differences. Women needed longer inter-stimulus intervals than men when indicating the temporal order of two clicks presented to the left and right ear. In this study, we examined whether we could reproduce these results in

  5. Auditory temporal-order thresholds show no gender differences

    NARCIS (Netherlands)

    van Kesteren, Marlieke T R; Wiersinga-Post, J Esther C

    2007-01-01

    PURPOSE: Several studies on auditory temporal-order processing showed gender differences. Women needed longer inter-stimulus intervals than men when indicating the temporal order of two clicks presented to the left and right ear. In this study, we examined whether we could reproduce these results in

  6. Reinforcement learning for a biped robot based on a CPG-actor-critic method.

    Science.gov (United States)

    Nakamura, Yutaka; Mori, Takeshi; Sato, Masa-aki; Ishii, Shin

    2007-08-01

    Animals' rhythmic movements, such as locomotion, are considered to be controlled by neural circuits called central pattern generators (CPGs), which generate oscillatory signals. Motivated by this biological mechanism, studies have been conducted on the rhythmic movements controlled by CPG. As an autonomous learning framework for a CPG controller, we propose in this article a reinforcement learning method we call the "CPG-actor-critic" method. This method introduces a new architecture to the actor, and its training is roughly based on a stochastic policy gradient algorithm presented recently. We apply this method to an automatic acquisition problem of control for a biped robot. Computer simulations show that training of the CPG can be successfully performed by our method, thus allowing the biped robot to not only walk stably but also adapt to environmental changes.

  7. Phasic dopamine as a prediction error of intrinsic and extrinsic reinforcements driving both action acquisition and reward maximization: a simulated robotic study.

    Science.gov (United States)

    Mirolli, Marco; Santucci, Vieri G; Baldassarre, Gianluca

    2013-03-01

    An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions. Copyright © 2013 Elsevier Ltd. All rights reserved.

  8. Real time eye tracking using Kalman extended spatio-temporal context learning

    Science.gov (United States)

    Munir, Farzeen; Minhas, Fayyaz ul Amir Asfar; Jalil, Abdul; Jeon, Moongu

    2017-06-01

    Real time eye tracking has numerous applications in human computer interaction such as a mouse cursor control in a computer system. It is useful for persons with muscular or motion impairments. However, tracking the movement of the eye is complicated by occlusion due to blinking, head movement, screen glare, rapid eye movements, etc. In this work, we present the algorithmic and construction details of a real time eye tracking system. Our proposed system is an extension of Spatio-Temporal context learning through Kalman Filtering. Spatio-Temporal Context Learning offers state of the art accuracy in general object tracking but its performance suffers due to object occlusion. Addition of the Kalman filter allows the proposed method to model the dynamics of the motion of the eye and provide robust eye tracking in cases of occlusion. We demonstrate the effectiveness of this tracking technique by controlling the computer cursor in real time by eye movements.

  9. Visual reinforcement shapes eye movements in visual search.

    Science.gov (United States)

    Paeye, Céline; Schütz, Alexander C; Gegenfurtner, Karl R

    2016-08-01

    We use eye movements to gain information about our visual environment; this information can indirectly be used to affect the environment. Whereas eye movements are affected by explicit rewards such as points or money, it is not clear whether the information gained by finding a hidden target has a similar reward value. Here we tested whether finding a visual target can reinforce eye movements in visual search performed in a noise background, which conforms to natural scene statistics and contains a large number of possible target locations. First we tested whether presenting the target more often in one specific quadrant would modify eye movement search behavior. Surprisingly, participants did not learn to search for the target more often in high probability areas. Presumably, participants could not learn the reward structure of the environment. In two subsequent experiments we used a gaze-contingent display to gain full control over the reinforcement schedule. The target was presented more often after saccades into a specific quadrant or a specific direction. The proportions of saccades meeting the reinforcement criteria increased considerably, and participants matched their search behavior to the relative reinforcement rates of targets. Reinforcement learning seems to serve as the mechanism to optimize search behavior with respect to the statistics of the task.

  10. Reinforcement learning solution for HJB equation arising in constrained optimal control problem.

    Science.gov (United States)

    Luo, Biao; Wu, Huai-Ning; Huang, Tingwen; Liu, Derong

    2015-11-01

    The constrained optimal control problem depends on the solution of the complicated Hamilton-Jacobi-Bellman equation (HJBE). In this paper, a data-based off-policy reinforcement learning (RL) method is proposed, which learns the solution of the HJBE and the optimal control policy from real system data. One important feature of the off-policy RL is that its policy evaluation can be realized with data generated by other behavior policies, not necessarily the target policy, which solves the insufficient exploration problem. The convergence of the off-policy RL is proved by demonstrating its equivalence to the successive approximation approach. Its implementation procedure is based on the actor-critic neural networks structure, where the function approximation is conducted with linearly independent basis functions. Subsequently, the convergence of the implementation procedure with function approximation is also proved. Finally, its effectiveness is verified through computer simulations. Copyright © 2015 Elsevier Ltd. All rights reserved.

  11. Suboptimal choice, reward-predictive signals, and temporal information.

    Science.gov (United States)

    Cunningham, Paul J; Shahan, Timothy A

    2018-01-01

    Suboptimal choice refers to preference for an alternative offering a low probability of food (suboptimal alternative) over an alternative offering a higher probability of food (optimal alternative). Numerous studies have found that stimuli signaling probabilistic food play a critical role in the development and maintenance of suboptimal choice. However, there is still much debate about how to characterize how these stimuli influence suboptimal choice. There is substantial evidence that the temporal information conveyed by a food-predictive signal governs its function as both a Pavlovian conditioned stimulus and as an instrumental conditioned reinforcer. Thus, we explore the possibility that food-predictive signals influence suboptimal choice via the temporal information they convey. Application of this temporal information-theoretic approach to suboptimal choice provides a formal, quantitative framework that describes how food-predictive signals influence suboptimal choice in a manner consistent with related phenomena in Pavlovian conditioning and conditioned reinforcement. Our reanalysis of previous data on suboptimal choice suggests that, generally speaking, preference in the suboptimal choice procedure tracks relative temporal information conveyed by food-predictive signals for the suboptimal and optimal alternatives. The model suggests that suboptimal choice develops when the food-predictive signal for the suboptimal alternative conveys more temporal information than that for the optimal alternative. Finally, incorporating a role for competition between temporal information provided by food-predictive signals and relative primary reinforcement rate provides a reasonable account of existing data on suboptimal choice. (PsycINFO Database Record (c) 2018 APA, all rights reserved).

  12. Vicarious Reinforcement In Rhesus Macaques (Macaca mulatta

    Directory of Open Access Journals (Sweden)

    Steve W. C. Chang

    2011-03-01

    Full Text Available What happens to others profoundly influences our own behavior. Such other-regarding outcomes can drive observational learning, as well as motivate cooperation, charity, empathy, and even spite. Vicarious reinforcement may serve as one of the critical mechanisms mediating the influence of other-regarding outcomes on behavior and decision-making in groups. Here we show that rhesus macaques spontaneously derive vicarious reinforcement from observing rewards given to another monkey, and that this reinforcement can motivate them to subsequently deliver or withhold rewards from the other animal. We exploited Pavlovian and instrumental conditioning to associate rewards to self (M1 and/or rewards to another monkey (M2 with visual cues. M1s made more errors in the instrumental trials when cues predicted reward to M2 compared to when cues predicted reward to M1, but made even more errors when cues predicted reward to no one. In subsequent preference tests between pairs of conditioned cues, M1s preferred cues paired with reward to M2 over cues paired with reward to no one. By contrast, M1s preferred cues paired with reward to self over cues paired with reward to both monkeys simultaneously. Rates of attention to M2 strongly predicted the strength and valence of vicarious reinforcement. These patterns of behavior, which were absent in nonsocial control trials, are consistent with vicarious reinforcement based upon sensitivity to observed, or counterfactual, outcomes with respect to another individual. Vicarious reward may play a critical role in shaping cooperation and competition, as well as motivating observational learning and group coordination in rhesus macaques, much as it does in humans. We propose that vicarious reinforcement signals mediate these behaviors via homologous neural circuits involved in reinforcement learning and decision-making.

  13. Vicarious reinforcement in rhesus macaques (macaca mulatta).

    Science.gov (United States)

    Chang, Steve W C; Winecoff, Amy A; Platt, Michael L

    2011-01-01

    What happens to others profoundly influences our own behavior. Such other-regarding outcomes can drive observational learning, as well as motivate cooperation, charity, empathy, and even spite. Vicarious reinforcement may serve as one of the critical mechanisms mediating the influence of other-regarding outcomes on behavior and decision-making in groups. Here we show that rhesus macaques spontaneously derive vicarious reinforcement from observing rewards given to another monkey, and that this reinforcement can motivate them to subsequently deliver or withhold rewards from the other animal. We exploited Pavlovian and instrumental conditioning to associate rewards to self (M1) and/or rewards to another monkey (M2) with visual cues. M1s made more errors in the instrumental trials when cues predicted reward to M2 compared to when cues predicted reward to M1, but made even more errors when cues predicted reward to no one. In subsequent preference tests between pairs of conditioned cues, M1s preferred cues paired with reward to M2 over cues paired with reward to no one. By contrast, M1s preferred cues paired with reward to self over cues paired with reward to both monkeys simultaneously. Rates of attention to M2 strongly predicted the strength and valence of vicarious reinforcement. These patterns of behavior, which were absent in non-social control trials, are consistent with vicarious reinforcement based upon sensitivity to observed, or counterfactual, outcomes with respect to another individual. Vicarious reward may play a critical role in shaping cooperation and competition, as well as motivating observational learning and group coordination in rhesus macaques, much as it does in humans. We propose that vicarious reinforcement signals mediate these behaviors via homologous neural circuits involved in reinforcement learning and decision-making.

  14. Learning Theory and the Typewriter Teacher

    Science.gov (United States)

    Wakin, B. Bertha

    1974-01-01

    Eight basic principles of learning are described and discussed in terms of practical learning strategies for typewriting. Described are goal setting, preassessment, active participation, individual differences, reinforcement, practice, transfer of learning, and evaluation. (SC)

  15. Visual Statistical Learning Works after Binding the Temporal Sequences of Shapes and Spatial Positions

    Directory of Open Access Journals (Sweden)

    Osamu Watanabe

    2011-05-01

    Full Text Available The human visual system can acquire the statistical structures in temporal sequences of object feature changes, such as changes in shape, color, and its combination. Here we investigate whether the statistical learning for spatial position and shape changes operates separately or not. It is known that the visual system processes these two types of information separately; the spatial information is processed in the parietal cortex, whereas object shapes and colors are detected in the temporal pathway, and, after that, we perceive bound information in the two streams. We examined whether the statistical learning operates before or after binding the shape and the spatial information by using the “re-paired triplet” paradigm proposed by Turk-Browne, Isola, Scholl, and Treat (2008. The result showed that observers acquired combined sequences of shape and position changes, but no statistical information in individual sequence was obtained. This finding suggests that the visual statistical learning works after binding the temporal sequences of shapes and spatial structures and would operate in the higher-order visual system; this is consistent with recent ERP (Abla & Okanoya, 2009 and fMRI (Turk-Browne, Scholl, Chun, & Johnson, 2009 studies.

  16. How partial reinforcement of food cues affects the extinction and reacquisition of appetitive responses. A new model for dieting success?

    Science.gov (United States)

    van den Akker, Karolien; Havermans, Remco C; Bouton, Mark E; Jansen, Anita

    2014-10-01

    Animals and humans can easily learn to associate an initially neutral cue with food intake through classical conditioning, but extinction of learned appetitive responses can be more difficult. Intermittent or partial reinforcement of food cues causes especially persistent behaviour in animals: after exposure to such learning schedules, the decline in responding that occurs during extinction is slow. After extinction, increases in responding with renewed reinforcement of food cues (reacquisition) might be less rapid after acquisition with partial reinforcement. In humans, it may be that the eating behaviour of some individuals resembles partial reinforcement schedules to a greater extent, possibly affecting dieting success by interacting with extinction and reacquisition. Furthermore, impulsivity has been associated with less successful dieting, and this association might be explained by impulsivity affecting the learning and extinction of appetitive responses. In the present two studies, the effects of different reinforcement schedules and impulsivity on the acquisition, extinction, and reacquisition of appetitive responses were investigated in a conditioning paradigm involving food rewards in healthy humans. Overall, the results indicate both partial reinforcement schedules and, possibly, impulsivity to be associated with worse extinction performance. A new model of dieting success is proposed: learning histories and, perhaps, certain personality traits (impulsivity) can interfere with the extinction and reacquisition of appetitive responses to food cues and they may be causally related to unsuccessful dieting. Copyright © 2014 Elsevier Ltd. All rights reserved.

  17. Spike-Based Bayesian-Hebbian Learning of Temporal Sequences.

    Directory of Open Access Journals (Sweden)

    Philip J Tully

    2016-05-01

    Full Text Available Many cognitive and motor functions are enabled by the temporal representation and processing of stimuli, but it remains an open issue how neocortical microcircuits can reliably encode and replay such sequences of information. To better understand this, a modular attractor memory network is proposed in which meta-stable sequential attractor transitions are learned through changes to synaptic weights and intrinsic excitabilities via the spike-based Bayesian Confidence Propagation Neural Network (BCPNN learning rule. We find that the formation of distributed memories, embodied by increased periods of firing in pools of excitatory neurons, together with asymmetrical associations between these distinct network states, can be acquired through plasticity. The model's feasibility is demonstrated using simulations of adaptive exponential integrate-and-fire model neurons (AdEx. We show that the learning and speed of sequence replay depends on a confluence of biophysically relevant parameters including stimulus duration, level of background noise, ratio of synaptic currents, and strengths of short-term depression and adaptation. Moreover, sequence elements are shown to flexibly participate multiple times in the sequence, suggesting that spiking attractor networks of this type can support an efficient combinatorial code. The model provides a principled approach towards understanding how multiple interacting plasticity mechanisms can coordinate hetero-associative learning in unison.

  18. Temporal Dynamics of Task Switching and Abstract-Concept Learning in Pigeons

    Directory of Open Access Journals (Sweden)

    Thomas Alexander Daniel

    2015-09-01

    Full Text Available The current study examined whether pigeons could learn to use abstract concepts as the basis for conditionally switching behavior as a function of time. Using a mid-session reversal task, experienced pigeons were trained to switch from matching-to-sample (MTS to non-matching-to-sample (NMTS conditional discriminations within a session. One group had prior training with MTS, while the other had prior training with NMTS. Over training, stimulus set size was progressively doubled from 3 to 6 to 12 stimuli to promote abstract concept development. Prior experience had an effect on the initial learning at each of the set sizes but by the end of training there were no group differences, as both groups showed similar within-session linear matching functions. After acquiring the 12-item set, abstract-concept learning was tested by placing novel stimuli at the beginning and end of a test session. Prior matching and non-matching experience affected transfer behavior. The matching experienced group transferred to novel stimuli in both the matching and non-matching portion of the sessions using a matching rule. The non-matching experienced group transferred to novel stimuli in both portions of the session using a non-matching rule. The representations used as the basis for mid-session reversal of the conditional discrimination behaviors and subsequent transfer behavior appears to have different temporal sources. The implications for the flexibility and organization of complex behaviors are considered.

  19. Object-Oriented Hierarchy Radiation Consistency for Different Temporal and Different Sensor Images

    Directory of Open Access Journals (Sweden)

    Nan Su

    2018-02-01

    Full Text Available In the paper, we propose a novel object-oriented hierarchy radiation consistency method for dense matching of different temporal and different sensor data in the 3D reconstruction. For different temporal images, our illumination consistency method is proposed to solve both the illumination uniformity for a single image and the relative illumination normalization for image pairs. Especially in the relative illumination normalization step, singular value equalization and linear relationship of the invariant pixels is combined used for the initial global illumination normalization and the object-oriented refined illumination normalization in detail, respectively. For different sensor images, we propose the union group sparse method, which is based on improving the original group sparse model. The different sensor images are set to a similar smoothness level by the same threshold of singular value from the union group matrix. Our method comprehensively considered the influence factors on the dense matching of the different temporal and different sensor stereoscopic image pairs to simultaneously improve the illumination consistency and the smoothness consistency. The radiation consistency experimental results verify the effectiveness and superiority of the proposed method by comparing two other methods. Moreover, in the dense matching experiment of the mixed stereoscopic image pairs, our method has more advantages for objects in the urban area.

  20. Distinguishing between learning and motivation in behavioral tests of the reinforcement sensitivity theory of personality.

    Science.gov (United States)

    Smillie, Luke D; Dalgleish, Len I; Jackson, Chris J

    2007-04-01

    According to Gray's (1973) Reinforcement Sensitivity Theory (RST), a Behavioral Inhibition System (BIS) and a Behavioral Activation System (BAS) mediate effects of goal conflict and reward on behavior. BIS functioning has been linked with individual differences in trait anxiety and BAS functioning with individual differences in trait impulsivity. In this article, it is argued that behavioral outputs of the BIS and BAS can be distinguished in terms of learning and motivation processes and that these can be operationalized using the Signal Detection Theory measures of response-sensitivity and response-bias. In Experiment 1, two measures of BIS-reactivity predicted increased response-sensitivity under goal conflict, whereas one measure of BAS-reactivity predicted increased response-sensitivity under reward. In Experiment 2, two measures of BIS-reactivity predicted response-bias under goal conflict, whereas a measure of BAS-reactivity predicted motivation response-bias under reward. In both experiments, impulsivity measures did not predict criteria for BAS-reactivity as traditionally predicted by RST.

  1. The reward of seeing: Different types of visual reward and their ability to modify oculomotor learning.

    Science.gov (United States)

    Meermeier, Annegret; Gremmler, Svenja; Richert, Kerstin; Eckermann, Til; Lappe, Markus

    2017-10-01

    Saccadic adaptation is an oculomotor learning process that maintains the accuracy of eye movements to ensure effective perception of the environment. Although saccadic adaptation is commonly considered an automatic and low-level motor calibration in the cerebellum, we recently found that strength of adaptation is influenced by the visual content of the target: pictures of humans produced stronger adaptation than noise stimuli. This suggests that meaningful images may be considered rewarding or valuable in oculomotor learning. Here we report three experiments that establish the boundaries of this effect. In the first, we tested whether stimuli that were associated with high and low value following long term self-administered reinforcement learning produce stronger adaptation. Twenty-eight expert gamers participated in two sessions of adaptation to game-related high- and low-reward stimuli, but revealed no difference in saccadic adaptation (Bayes Factor01 = 5.49). In the second experiment, we tested whether cognitive (literate) meaning could induce stronger adaptation by comparing targets consisting of words and nonwords. The results of twenty subjects revealed no difference in adaptation strength (Bayes Factor01 = 3.21). The third experiment compared images of human figures to noise patterns for reactive saccades. Twenty-two subjects adapted significantly more toward images of human figures in comparison to noise (p vs. noise), but not secondary, reinforcement affects saccadic adaptation (words vs. nonwords, high- vs. low-value video game images).

  2. On the limits of statistical learning: Intertrial contextual cueing is confined to temporally close contingencies.

    Science.gov (United States)

    Thomas, Cyril; Didierjean, André; Maquestiaux, François; Goujon, Annabelle

    2018-04-12

    Since the seminal study by Chun and Jiang (Cognitive Psychology, 36, 28-71, 1998), a large body of research based on the contextual-cueing paradigm has shown that the cognitive system is capable of extracting statistical contingencies from visual environments. Most of these studies have focused on how individuals learn regularities found within an intratrial temporal window: A context predicts the target position within a given trial. However, Ono, Jiang, and Kawahara (Journal of Experimental Psychology, 31, 703-712, 2005) provided evidence of an intertrial implicit-learning effect when a distractor configuration in preceding trials N - 1 predicted the target location in trials N. The aim of the present study was to gain further insight into this effect by examining whether it occurs when predictive relationships are impeded by interfering task-relevant noise (Experiments 2 and 3) or by a long delay (Experiments 1, 4, and 5). Our results replicated the intertrial contextual-cueing effect, which occurred in the condition of temporally close contingencies. However, there was no evidence of integration across long-range spatiotemporal contingencies, suggesting a temporal limitation of statistical learning.

  3. Three Theories of Learning and Their Implications for Teachers.

    Science.gov (United States)

    Ramirez, Aura I.

    Currently, three theories of learning dominate classroom practice. First, B.F. Skinner's Theory of Operant Conditioning states that if behavior, including learning behavior, is reinforced, the probability of its being repeated increases strongly. Different types and schedules of reinforcement have been studied, by Skinner and others, and the…

  4. Implementation of real-time energy management strategy based on reinforcement learning for hybrid electric vehicles and simulation validation.

    Science.gov (United States)

    Kong, Zehui; Zou, Yuan; Liu, Teng

    2017-01-01

    To further improve the fuel economy of series hybrid electric tracked vehicles, a reinforcement learning (RL)-based real-time energy management strategy is developed in this paper. In order to utilize the statistical characteristics of online driving schedule effectively, a recursive algorithm for the transition probability matrix (TPM) of power-request is derived. The reinforcement learning (RL) is applied to calculate and update the control policy at regular time, adapting to the varying driving conditions. A facing-forward powertrain model is built in detail, including the engine-generator model, battery model and vehicle dynamical model. The robustness and adaptability of real-time energy management strategy are validated through the comparison with the stationary control strategy based on initial transition probability matrix (TPM) generated from a long naturalistic driving cycle in the simulation. Results indicate that proposed method has better fuel economy than stationary one and is more effective in real-time control.

  5. Implementation of real-time energy management strategy based on reinforcement learning for hybrid electric vehicles and simulation validation.

    Directory of Open Access Journals (Sweden)

    Zehui Kong

    Full Text Available To further improve the fuel economy of series hybrid electric tracked vehicles, a reinforcement learning (RL-based real-time energy management strategy is developed in this paper. In order to utilize the statistical characteristics of online driving schedule effectively, a recursive algorithm for the transition probability matrix (TPM of power-request is derived. The reinforcement learning (RL is applied to calculate and update the control policy at regular time, adapting to the varying driving conditions. A facing-forward powertrain model is built in detail, including the engine-generator model, battery model and vehicle dynamical model. The robustness and adaptability of real-time energy management strategy are validated through the comparison with the stationary control strategy based on initial transition probability matrix (TPM generated from a long naturalistic driving cycle in the simulation. Results indicate that proposed method has better fuel economy than stationary one and is more effective in real-time control.

  6. General asymmetric neutral networks and structure design by genetic algorithms: A learning rule for temporal patterns

    Energy Technology Data Exchange (ETDEWEB)

    Bornholdt, S. [Heidelberg Univ., (Germany). Inst., fuer Theoretische Physik; Graudenz, D. [Lawrence Berkeley Lab., CA (United States)

    1993-07-01

    A learning algorithm based on genetic algorithms for asymmetric neural networks with an arbitrary structure is presented. It is suited for the learning of temporal patterns and leads to stable neural networks with feedback.

  7. General asymmetric neutral networks and structure design by genetic algorithms: A learning rule for temporal patterns

    International Nuclear Information System (INIS)

    Bornholdt, S.

    1993-07-01

    A learning algorithm based on genetic algorithms for asymmetric neural networks with an arbitrary structure is presented. It is suited for the learning of temporal patterns and leads to stable neural networks with feedback

  8. Application of Reinforcement Learning in Cognitive Radio Networks: Models and Algorithms

    Directory of Open Access Journals (Sweden)

    Kok-Lim Alvin Yau

    2014-01-01

    Full Text Available Cognitive radio (CR enables unlicensed users to exploit the underutilized spectrum in licensed spectrum whilst minimizing interference to licensed users. Reinforcement learning (RL, which is an artificial intelligence approach, has been applied to enable each unlicensed user to observe and carry out optimal actions for performance enhancement in a wide range of schemes in CR, such as dynamic channel selection and channel sensing. This paper presents new discussions of RL in the context of CR networks. It provides an extensive review on how most schemes have been approached using the traditional and enhanced RL algorithms through state, action, and reward representations. Examples of the enhancements on RL, which do not appear in the traditional RL approach, are rules and cooperative learning. This paper also reviews performance enhancements brought about by the RL algorithms and open issues. This paper aims to establish a foundation in order to spark new research interests in this area. Our discussion has been presented in a tutorial manner so that it is comprehensive to readers outside the specialty of RL and CR.

  9. Behavioral sensitivity of temporally modulated striatal neurons

    Directory of Open Access Journals (Sweden)

    George ePortugal

    2011-07-01

    Full Text Available Recent investigations into the neural mechanisms that underlie temporal perception have revealed that the striatum is an important contributor to interval timing processes, and electrophysiological recording studies have shown that the firing rates of striatal neurons are modulated by the time in a trial at which an operant response is made. However, it remains unclear whether striatal firing rate modulations are related to the passage of time alone (i.e., whether temporal information is represented in an abstract manner independent of other attributes of biological importance, or whether this temporal information is embedded within striatal activity related to co-occurring contextual information, such as motor behaviors. This study evaluated these two hypotheses by recording from striatal neurons while rats performed a temporal production task. Rats were trained to respond at different nosepoke apertures for food reward under two simultaneously active reinforcement schedules: a variable-interval (VI-15 sec schedule and a fixed-interval (FI-15 sec schedule of reinforcement. Responding during a trial occurred in a sequential manner composing 3 phases; VI responding, FI responding, VI responding. The vast majority of task-sensitive striatal neurons (95% varied their firing rates associated with equivalent behaviors (e.g., periods in which their snout was held within the nosepoke across these behavioral phases, and 96% of cells varied their firing rates for the same behavior within a phase, thereby demonstrating their sensitivity to time. However, in a direct test of the abstract timing hypothesis, 91% of temporally modulated hold cells were further modulated by the overt motor behaviors associated with transitioning between nosepokes. As such, these data are inconsistent with the striatum representing time in an abstract’ manner, but support the hypothesis that temporal information is embedded within contextual and motor functions of the

  10. Do personality traits predict individual differences in excitatory and inhibitory learning?

    Directory of Open Access Journals (Sweden)

    Zhimin eHe

    2013-05-01

    Full Text Available Conditioned inhibition (CI is demonstrated in classical conditioning when a stimulus is used to signal the omission of an otherwise expected outcome. This basic learning ability is involved in a wide range of normal behaviour - and thus its disruption could produce a correspondingly wide range of behavioural deficits. The present study employed a computer-based task to measure conditioned excitation and inhibition in the same discrimination procedure. Conditioned inhibition by summation test was clearly demonstrated. Additionally summary measures of excitatory and inhibitory learning (difference scores were calculated in order to explore how performance related to individual differences in a large sample of normal participants (n=176 following exclusion of those not meeting the basic learning criterion. The individual difference measures selected derive from two biologically-based personality theories, Gray’s reinforcement sensitivity theory (1982 and Eysenck’s psychoticism, extraversion and neuroticism theory (1991. Following the behavioural tasks, participants completed the behavioural inhibition system/behavioural activation system scales (BIS/BAS and the Eysenck personality questionnaire revised short scale (EPQ-RS. Analyses of the relationship between scores on each of the scales and summary measures of excitatory and inhibitory learning suggested that those with higher BAS (specifically the drive sub-scale and higher EPQ-RS neuroticism showed reduced levels of excitatory conditioning. Inhibitory conditioning was similarly attenuated in those with higher EPQ-RS neuroticism, as well as in those with higher BIS scores. Thus the findings are consistent with higher levels of neuroticism being accompanied by generally impaired associative learning, both inhibitory and excitatory. There was also evidence for some dissociation in the effects of behavioural activation and behavioural inhibition on excitatory and inhibitory learning respectively.

  11. Nature vs Nurture: Effects of Learning on Evolution

    Science.gov (United States)

    Nagrani, Nagina

    In the field of Evolutionary Robotics, the design, development and application of artificial neural networks as controllers have derived their inspiration from biology. Biologists and artificial intelligence researchers are trying to understand the effects of neural network learning during the lifetime of the individuals on evolution of these individuals by qualitative and quantitative analyses. The conclusion of these analyses can help develop optimized artificial neural networks to perform any given task. The purpose of this thesis is to study the effects of learning on evolution. This has been done by applying Temporal Difference Reinforcement Learning methods to the evolution of Artificial Neural Tissue controller. The controller has been assigned the task to collect resources in a designated area in a simulated environment. The performance of the individuals is measured by the amount of resources collected. A comparison has been made between the results obtained by incorporating learning in evolution and evolution alone. The effects of learning parameters: learning rate, training period, discount rate, and policy on evolution have also been studied. It was observed that learning delays the performance of the evolving individuals over the generations. However, the non zero learning rate throughout the evolution process signifies natural selection preferring individuals possessing plasticity.

  12. Multiple concurrent temporal recalibrations driven by audiovisual stimuli with apparent physical differences.

    Science.gov (United States)

    Yuan, Xiangyong; Bi, Cuihua; Huang, Xiting

    2015-05-01

    Out-of-synchrony experiences can easily recalibrate one's subjective simultaneity point in the direction of the experienced asynchrony. Although temporal adjustment of multiple audiovisual stimuli has been recently demonstrated to be spatially specific, perceptual grouping processes that organize separate audiovisual stimuli into distinctive "objects" may play a more important role in forming the basis for subsequent multiple temporal recalibrations. We investigated whether apparent physical differences between audiovisual pairs that make them distinct from each other can independently drive multiple concurrent temporal recalibrations regardless of spatial overlap. Experiment 1 verified that reducing the physical difference between two audiovisual pairs diminishes the multiple temporal recalibrations by exposing observers to two utterances with opposing temporal relationships spoken by one single speaker rather than two distinct speakers at the same location. Experiment 2 found that increasing the physical difference between two stimuli pairs can promote multiple temporal recalibrations by complicating their non-temporal dimensions (e.g., disks composed of two rather than one attribute and tones generated by multiplying two frequencies); however, these recalibration aftereffects were subtle. Experiment 3 further revealed that making the two audiovisual pairs differ in temporal structures (one transient and one gradual) was sufficient to drive concurrent temporal recalibration. These results confirm that the more audiovisual pairs physically differ, especially in temporal profile, the more likely multiple temporal perception adjustments will be content-constrained regardless of spatial overlap. These results indicate that multiple temporal recalibrations are based secondarily on the outcome of perceptual grouping processes.

  13. Blended learning for reinforcing dental pharmacology in the clinical years: A qualitative analysis.

    Science.gov (United States)

    Eachempati, Prashanti; Kiran Kumar, K S; Sumanth, K N

    2016-10-01

    Blended learning has become the method of choice in educational institutions because of its systematic integration of traditional classroom teaching and online components. This study aims to analyze student's reflection regarding blended learning in dental pharmacology. A cross-sectional study was conducted in Faculty of Dentistry, Melaka-Manipal Medical College among 3 rd and 4 th year BDS students. A total of 145 dental students, who consented, participate in the study. Students were divided into 14 groups. Nine online sessions followed by nine face-to-face discussions were held. Each session addressed topics related to oral lesions and orofacial pain with pharmacological applications. After each week, students were asked to reflect on blended learning. On completion of 9 weeks, reflections were collected and analyzed. Qualitative analysis was done using thematic analysis model suggested by Braun and Clarke. The four main themes were identified, namely, merits of blended learning, skill in writing prescription for oral diseases, dosages of drugs, and identification of strengths and weakness. In general, the participants had a positive feedback regarding blended learning. Students felt more confident in drug selection and prescription writing. They could recollect the doses better after the online and face-to-face sessions. Most interestingly, the students reflected that they are able to identify their strength and weakness after the blended learning sessions. Blended learning module was successfully implemented for reinforcing dental pharmacology. The results obtained in this study enable us to plan future comparative studies to know the effectiveness of blended learning in dental pharmacology.

  14. CAPES: Unsupervised Storage Performance Tuning Using Neural Network-Based Deep Reinforcement Learning

    CERN Multimedia

    CERN. Geneva

    2017-01-01

    Parameter tuning is an important task of storage performance optimization. Current practice usually involves numerous tweak-benchmark cycles that are slow and costly. To address this issue, we developed CAPES, a model-less deep reinforcement learning-based unsupervised parameter tuning system driven by a deep neural network (DNN). It is designed to nd the optimal values of tunable parameters in computer systems, from a simple client-server system to a large data center, where human tuning can be costly and often cannot achieve optimal performance. CAPES takes periodic measurements of a target computer system’s state, and trains a DNN which uses Q-learning to suggest changes to the system’s current parameter values. CAPES is minimally intrusive, and can be deployed into a production system to collect training data and suggest tuning actions during the system’s daily operation. Evaluation of a prototype on a Lustre system demonstrates an increase in I/O throughput up to 45% at saturation point. About the...

  15. Lung Nodule Detection via Deep Reinforcement Learning

    Directory of Open Access Journals (Sweden)

    Issa Ali

    2018-04-01

    Full Text Available Lung cancer is the most common cause of cancer-related death globally. As a preventive measure, the United States Preventive Services Task Force (USPSTF recommends annual screening of high risk individuals with low-dose computed tomography (CT. The resulting volume of CT scans from millions of people will pose a significant challenge for radiologists to interpret. To fill this gap, computer-aided detection (CAD algorithms may prove to be the most promising solution. A crucial first step in the analysis of lung cancer screening results using CAD is the detection of pulmonary nodules, which may represent early-stage lung cancer. The objective of this work is to develop and validate a reinforcement learning model based on deep artificial neural networks for early detection of lung nodules in thoracic CT images. Inspired by the AlphaGo system, our deep learning algorithm takes a raw CT image as input and views it as a collection of states, and output a classification of whether a nodule is present or not. The dataset used to train our model is the LIDC/IDRI database hosted by the lung nodule analysis (LUNA challenge. In total, there are 888 CT scans with annotations based on agreement from at least three out of four radiologists. As a result, there are 590 individuals having one or more nodules, and 298 having none. Our training results yielded an overall accuracy of 99.1% [sensitivity 99.2%, specificity 99.1%, positive predictive value (PPV 99.1%, negative predictive value (NPV 99.2%]. In our test, the results yielded an overall accuracy of 64.4% (sensitivity 58.9%, specificity 55.3%, PPV 54.2%, and NPV 60.0%. These early results show promise in solving the major issue of false positives in CT screening of lung nodules, and may help to save unnecessary follow-up tests and expenditures.

  16. Learning predictive statistics from temporal sequences: Dynamics and strategies.

    Science.gov (United States)

    Wang, Rui; Shen, Yuan; Tino, Peter; Welchman, Andrew E; Kourtzi, Zoe

    2017-10-01

    Human behavior is guided by our expectations about the future. Often, we make predictions by monitoring how event sequences unfold, even though such sequences may appear incomprehensible. Event structures in the natural environment typically vary in complexity, from simple repetition to complex probabilistic combinations. How do we learn these structures? Here we investigate the dynamics of structure learning by tracking human responses to temporal sequences that change in structure unbeknownst to the participants. Participants were asked to predict the upcoming item following a probabilistic sequence of symbols. Using a Markov process, we created a family of sequences, from simple frequency statistics (e.g., some symbols are more probable than others) to context-based statistics (e.g., symbol probability is contingent on preceding symbols). We demonstrate the dynamics with which individuals adapt to changes in the environment's statistics-that is, they extract the behaviorally relevant structures to make predictions about upcoming events. Further, we show that this structure learning relates to individual decision strategy; faster learning of complex structures relates to selection of the most probable outcome in a given context (maximizing) rather than matching of the exact sequence statistics. Our findings provide evidence for alternate routes to learning of behaviorally relevant statistics that facilitate our ability to predict future events in variable environments.

  17. Study on state grouping and opportunity evaluation for reinforcement learning methods; Kyoka gakushuho no tame no jotai grouping to opportunity hyoka ni kansuru kenkyu

    Energy Technology Data Exchange (ETDEWEB)

    Yu, W.; Yokoi, H.; Kakazu, Y. [Hokkaido University, Sapporo (Japan)

    1997-08-20

    In this paper, we propose the State Grouping scheme for coping with the problem of scaling up the Reinforcement Learning Algorithm to real, large size application. The grouping scheme is based on geographical and trial-error information, and is made up with state generating, state combining, state splitting, state forgetting procedures, with corresponding action selecting module and learning module. Also, we discuss the Labeling Based Evaluation scheme which can evaluate the opportunity of the state-action pair, therefore, use better experience to guide the exploration of the state-space effectively. Incorporating the Labeling Based Evaluation and State Grouping scheme into the Reinforcement Learning Algorithm, we get the approach that can generate organized state space for Reinforcement Learning, and do problem solving as well. We argue that the approach with this kind of ability is necessary for autonomous agent, namely, autonomous agent can not act depending on any pre-defined map, instead, it should search the environment as well as find the optimal problem solution autonomously and simultaneously. By solving the large state-size 3-DOF and 4-link manipulator problem, we show the efficiency of the proposed approach, i.e., the agent can achieve the optimal or sub-optimal path with less memory and less time. 14 refs., 11 figs., 3 tabs.

  18. Braided reinforced composite rods for the internal reinforcement of concrete

    Science.gov (United States)

    Gonilho Pereira, C.; Fangueiro, R.; Jalali, S.; Araujo, M.; Marques, P.

    2008-05-01

    This paper reports on the development of braided reinforced composite rods as a substitute for the steel reinforcement in concrete. The research work aims at understanding the mechanical behaviour of core-reinforced braided fabrics and braided reinforced composite rods, namely concerning the influence of the braiding angle, the type of core reinforcement fibre, and preloading and postloading conditions. The core-reinforced braided fabrics were made from polyester fibres for producing braided structures, and E-glass, carbon, HT polyethylene, and sisal fibres were used for the core reinforcement. The braided reinforced composite rods were obtained by impregnating the core-reinforced braided fabric with a vinyl ester resin. The preloading of the core-reinforced braided fabrics and the postloading of the braided reinforced composite rods were performed in three and two stages, respectively. The results of tensile tests carried out on different samples of core-reinforced braided fabrics are presented and discussed. The tensile and bending properties of the braided reinforced composite rods have been evaluated, and the results obtained are presented, discussed, and compared with those of conventional materials, such as steel.

  19. The partial-reinforcement extinction effect and the contingent-sampling hypothesis.

    Science.gov (United States)

    Hochman, Guy; Erev, Ido

    2013-12-01

    The partial-reinforcement extinction effect (PREE) implies that learning under partial reinforcements is more robust than learning under full reinforcements. While the advantages of partial reinforcements have been well-documented in laboratory studies, field research has failed to support this prediction. In the present study, we aimed to clarify this pattern. Experiment 1 showed that partial reinforcements increase the tendency to select the promoted option during extinction; however, this effect is much smaller than the negative effect of partial reinforcements on the tendency to select the promoted option during the training phase. Experiment 2 demonstrated that the overall effect of partial reinforcements varies inversely with the attractiveness of the alternative to the promoted behavior: The overall effect is negative when the alternative is relatively attractive, and positive when the alternative is relatively unattractive. These results can be captured with a contingent-sampling model assuming that people select options that provided the best payoff in similar past experiences. The best fit was obtained under the assumption that similarity is defined by the sequence of the last four outcomes.

  20. Deep Reinforcement Fuzzing

    OpenAIRE

    Böttinger, Konstantin; Godefroid, Patrice; Singh, Rishabh

    2018-01-01

    Fuzzing is the process of finding security vulnerabilities in input-processing code by repeatedly testing the code with modified inputs. In this paper, we formalize fuzzing as a reinforcement learning problem using the concept of Markov decision processes. This in turn allows us to apply state-of-the-art deep Q-learning algorithms that optimize rewards, which we define from runtime properties of the program under test. By observing the rewards caused by mutating with a specific set of actions...

  1. Understanding dopamine and reinforcement learning: the dopamine reward prediction error hypothesis.

    Science.gov (United States)

    Glimcher, Paul W

    2011-09-13

    A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn.

  2. Enabling an Integrated Rate-temporal Learning Scheme on Memristor

    Science.gov (United States)

    He, Wei; Huang, Kejie; Ning, Ning; Ramanathan, Kiruthika; Li, Guoqi; Jiang, Yu; Sze, Jiayin; Shi, Luping; Zhao, Rong; Pei, Jing

    2014-04-01

    Learning scheme is the key to the utilization of spike-based computation and the emulation of neural/synaptic behaviors toward realization of cognition. The biological observations reveal an integrated spike time- and spike rate-dependent plasticity as a function of presynaptic firing frequency. However, this integrated rate-temporal learning scheme has not been realized on any nano devices. In this paper, such scheme is successfully demonstrated on a memristor. Great robustness against the spiking rate fluctuation is achieved by waveform engineering with the aid of good analog properties exhibited by the iron oxide-based memristor. The spike-time-dependence plasticity (STDP) occurs at moderate presynaptic firing frequencies and spike-rate-dependence plasticity (SRDP) dominates other regions. This demonstration provides a novel approach in neural coding implementation, which facilitates the development of bio-inspired computing systems.

  3. Temporal sequence learning in winner-take-all networks of spiking neurons demonstrated in a brain-based device.

    Science.gov (United States)

    McKinstry, Jeffrey L; Edelman, Gerald M

    2013-01-01

    Animal behavior often involves a temporally ordered sequence of actions learned from experience. Here we describe simulations of interconnected networks of spiking neurons that learn to generate patterns of activity in correct temporal order. The simulation consists of large-scale networks of thousands of excitatory and inhibitory neurons that exhibit short-term synaptic plasticity and spike-timing dependent synaptic plasticity. The neural architecture within each area is arranged to evoke winner-take-all (WTA) patterns of neural activity that persist for tens of milliseconds. In order to generate and switch between consecutive firing patterns in correct temporal order, a reentrant exchange of signals between these areas was necessary. To demonstrate the capacity of this arrangement, we used the simulation to train a brain-based device responding to visual input by autonomously generating temporal sequences of motor actions.

  4. Differences between appetitive and aversive reinforcement on reorientation in a spatial working memory task.

    Science.gov (United States)

    Golob, Edward J; Taube, Jeffrey S

    2002-10-17

    Tasks using appetitive reinforcers show that following disorientation rats use the shape of an arena to reorient, and cannot distinguish two geometrically similar corners to obtain a reward, despite the presence of a prominent visual cue that provides information to differentiate the two corners. Other studies show that disorientation impairs performance on certain appetitive, but not aversive, tasks. This study evaluated whether rats would make similar geometric errors in a working memory task that used aversive reinforcement. We hypothesized that in a task that used aversive reinforcement rats that were initially disoriented would not reorient by arena shape and thus make similar geometric errors. Tests were performed in a rectangular arena having one polarizing cue. In the appetitive condition water consumption was the reward. The aversive condition was a water maze task with reinforcement provided by escape to a hidden platform. In the aversive condition rats returned to the reinforced corner significantly more often than in the dry condition, and did not favor the diagonally opposite corner. Results show that rats can use cues besides arena shape to reorient in an aversive reinforcement condition. These findings may also reflect different strategies, with an escape/homing strategy in the wet condition and a foraging strategy in the dry condition.

  5. Reinforcement Learning for Predictive Analytics in Smart Cities

    Directory of Open Access Journals (Sweden)

    Kostas Kolomvatsos

    2017-06-01

    Full Text Available The digitization of our lives cause a shift in the data production as well as in the required data management. Numerous nodes are capable of producing huge volumes of data in our everyday activities. Sensors, personal smart devices as well as the Internet of Things (IoT paradigm lead to a vast infrastructure that covers all the aspects of activities in modern societies. In the most of the cases, the critical issue for public authorities (usually, local, like municipalities is the efficient management of data towards the support of novel services. The reason is that analytics provided on top of the collected data could help in the delivery of new applications that will facilitate citizens’ lives. However, the provision of analytics demands intelligent techniques for the underlying data management. The most known technique is the separation of huge volumes of data into a number of parts and their parallel management to limit the required time for the delivery of analytics. Afterwards, analytics requests in the form of queries could be realized and derive the necessary knowledge for supporting intelligent applications. In this paper, we define the concept of a Query Controller ( Q C that receives queries for analytics and assigns each of them to a processor placed in front of each data partition. We discuss an intelligent process for query assignments that adopts Machine Learning (ML. We adopt two learning schemes, i.e., Reinforcement Learning (RL and clustering. We report on the comparison of the two schemes and elaborate on their combination. Our aim is to provide an efficient framework to support the decision making of the QC that should swiftly select the appropriate processor for each query. We provide mathematical formulations for the discussed problem and present simulation results. Through a comprehensive experimental evaluation, we reveal the advantages of the proposed models and describe the outcomes results while comparing them with a

  6. Contributions of Medial Temporal Lobe and Striatal Memory Systems to Learning and Retrieving Overlapping Spatial Memories

    Science.gov (United States)

    Brown, Thackery I.; Stern, Chantal E.

    2014-01-01

    Many life experiences share information with other memories. In order to make decisions based on overlapping memories, we need to distinguish between experiences to determine the appropriate behavior for the current situation. Previous work suggests that the medial temporal lobe (MTL) and medial caudate interact to support the retrieval of overlapping navigational memories in different contexts. The present study used functional magnetic resonance imaging (fMRI) in humans to test the prediction that the MTL and medial caudate play complementary roles in learning novel mazes that cross paths with, and must be distinguished from, previously learned routes. During fMRI scanning, participants navigated virtual routes that were well learned from prior training while also learning new mazes. Critically, some routes learned during scanning shared hallways with those learned during pre-scan training. Overlap between mazes required participants to use contextual cues to select between alternative behaviors. Results demonstrated parahippocampal cortex activity specific for novel spatial cues that distinguish between overlapping routes. The hippocampus and medial caudate were active for learning overlapping spatial memories, and increased their activity for previously learned routes when they became context dependent. Our findings provide novel evidence that the MTL and medial caudate play complementary roles in the learning, updating, and execution of context-dependent navigational behaviors. PMID:23448868

  7. Intrinsic interactive reinforcement learning - Using error-related potentials for real world human-robot interaction.

    Science.gov (United States)

    Kim, Su Kyoung; Kirchner, Elsa Andrea; Stefes, Arne; Kirchner, Frank

    2017-12-14

    Reinforcement learning (RL) enables robots to learn its optimal behavioral strategy in dynamic environments based on feedback. Explicit human feedback during robot RL is advantageous, since an explicit reward function can be easily adapted. However, it is very demanding and tiresome for a human to continuously and explicitly generate feedback. Therefore, the development of implicit approaches is of high relevance. In this paper, we used an error-related potential (ErrP), an event-related activity in the human electroencephalogram (EEG), as an intrinsically generated implicit feedback (rewards) for RL. Initially we validated our approach with seven subjects in a simulated robot learning scenario. ErrPs were detected online in single trial with a balanced accuracy (bACC) of 91%, which was sufficient to learn to recognize gestures and the correct mapping between human gestures and robot actions in parallel. Finally, we validated our approach in a real robot scenario, in which seven subjects freely chose gestures and the real robot correctly learned the mapping between gestures and actions (ErrP detection (90% bACC)). In this paper, we demonstrated that intrinsically generated EEG-based human feedback in RL can successfully be used to implicitly improve gesture-based robot control during human-robot interaction. We call our approach intrinsic interactive RL.

  8. Randomized controlled trial evaluating the temporal effects of high-intensity exercise on learning, short-term and long-term memory, and prospective memory.

    Science.gov (United States)

    Frith, Emily; Sng, Eveleen; Loprinzi, Paul D

    2017-11-01

    The broader purpose of this study was to examine the temporal effects of high-intensity exercise on learning, short-term and long-term retrospective memory and prospective memory. Among a sample of 88 young adult participants, 22 were randomized into one of four different groups: exercise before learning, control group, exercise during learning, and exercise after learning. The retrospective assessments (learning, short-term and long-term memory) were assessed using the Rey Auditory Verbal Learning Test. Long-term memory including a 20-min and 24-hr follow-up assessment. Prospective memory was assessed using a time-based procedure by having participants contact (via phone) the researchers at a follow-up time period. The exercise stimulus included a 15-min bout of progressive maximal exertion treadmill exercise. High-intensity exercise prior to memory encoding (vs. exercise during memory encoding or consolidation) was effective in enhancing long-term memory (for both 20-min and 24-h follow-up assessments). We did not observe a differential temporal effect of high-intensity exercise on short-term memory (immediate post-memory encoding), learning or prospective memory. The timing of high-intensity exercise may play an important role in facilitating long-term memory. © 2017 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.

  9. A psychophysical investigation of differences between synchrony and temporal order judgments.

    Science.gov (United States)

    Love, Scott A; Petrini, Karin; Cheng, Adam; Pollick, Frank E

    2013-01-01

    Synchrony judgments involve deciding whether cues to an event are in synch or out of synch, while temporal order judgments involve deciding which of the cues came first. When the cues come from different sensory modalities these judgments can be used to investigate multisensory integration in the temporal domain. However, evidence indicates that that these two tasks should not be used interchangeably as it is unlikely that they measure the same perceptual mechanism. The current experiment further explores this issue across a variety of different audiovisual stimulus types. Participants were presented with 5 audiovisual stimulus types, each at 11 parametrically manipulated levels of cue asynchrony. During separate blocks, participants had to make synchrony judgments or temporal order judgments. For some stimulus types many participants were unable to successfully make temporal order judgments, but they were able to make synchrony judgments. The mean points of subjective simultaneity for synchrony judgments were all video-leading, while those for temporal order judgments were all audio-leading. In the within participants analyses no correlation was found across the two tasks for either the point of subjective simultaneity or the temporal integration window. Stimulus type influenced how the two tasks differed; nevertheless, consistent differences were found between the two tasks regardless of stimulus type. Therefore, in line with previous work, we conclude that synchrony and temporal order judgments are supported by different perceptual mechanisms and should not be interpreted as being representative of the same perceptual process.

  10. Linking Individual Learning Styles to Approach-Avoidance Motivational Traits and Computational Aspects of Reinforcement Learning.

    Directory of Open Access Journals (Sweden)

    Kristoffer Carl Aberg

    Full Text Available Learning how to gain rewards (approach learning and avoid punishments (avoidance learning is fundamental for everyday life. While individual differences in approach and avoidance learning styles have been related to genetics and aging, the contribution of personality factors, such as traits, remains undetermined. Moreover, little is known about the computational mechanisms mediating differences in learning styles. Here, we used a probabilistic selection task with positive and negative feedbacks, in combination with computational modelling, to show that individuals displaying better approach (vs. avoidance learning scored higher on measures of approach (vs. avoidance trait motivation, but, paradoxically, also displayed reduced learning speed following positive (vs. negative outcomes. These data suggest that learning different types of information depend on associated reward values and internal motivational drives, possibly determined by personality traits.

  11. Linking Individual Learning Styles to Approach-Avoidance Motivational Traits and Computational Aspects of Reinforcement Learning

    Science.gov (United States)

    Carl Aberg, Kristoffer; Doell, Kimberly C.; Schwartz, Sophie

    2016-01-01

    Learning how to gain rewards (approach learning) and avoid punishments (avoidance learning) is fundamental for everyday life. While individual differences in approach and avoidance learning styles have been related to genetics and aging, the contribution of personality factors, such as traits, remains undetermined. Moreover, little is known about the computational mechanisms mediating differences in learning styles. Here, we used a probabilistic selection task with positive and negative feedbacks, in combination with computational modelling, to show that individuals displaying better approach (vs. avoidance) learning scored higher on measures of approach (vs. avoidance) trait motivation, but, paradoxically, also displayed reduced learning speed following positive (vs. negative) outcomes. These data suggest that learning different types of information depend on associated reward values and internal motivational drives, possibly determined by personality traits. PMID:27851807

  12. Off-policy reinforcement learning for H∞ control design.

    Science.gov (United States)

    Luo, Biao; Wu, Huai-Ning; Huang, Tingwen

    2015-01-01

    The H∞ control design problem is considered for nonlinear systems with unknown internal system model. It is known that the nonlinear H∞ control problem can be transformed into solving the so-called Hamilton-Jacobi-Isaacs (HJI) equation, which is a nonlinear partial differential equation that is generally impossible to be solved analytically. Even worse, model-based approaches cannot be used for approximately solving HJI equation, when the accurate system model is unavailable or costly to obtain in practice. To overcome these difficulties, an off-policy reinforcement leaning (RL) method is introduced to learn the solution of HJI equation from real system data instead of mathematical system model, and its convergence is proved. In the off-policy RL method, the system data can be generated with arbitrary policies rather than the evaluating policy, which is extremely important and promising for practical systems. For implementation purpose, a neural network (NN)-based actor-critic structure is employed and a least-square NN weight update algorithm is derived based on the method of weighted residuals. Finally, the developed NN-based off-policy RL method is tested on a linear F16 aircraft plant, and further applied to a rotational/translational actuator system.

  13. Parental Positive Reinforcement with Deviant Children: Does It Make a Difference?

    Science.gov (United States)

    Forehand, Rex

    1986-01-01

    Considers effectiveness of parental positive reinforcement with deviant children by reviewing the following: (1) non-intervention studies, (2) intervention studies, and (3) consumer (parental) satisfaction studies. Results indicate that parents view positive reinforcement as effective and useful although positive reinforcement is not sufficient to…

  14. Study and Application of Reinforcement Learning in Cooperative Strategy of the Robot Soccer Based on BDI Model

    Directory of Open Access Journals (Sweden)

    Wu Bo-ying

    2009-11-01

    Full Text Available The dynamic cooperation model of multi-Agent is formed by combining reinforcement learning with BDI model. In this model, the concept of the individual optimization loses its meaning, because the repayment of each Agent dose not only depend on itsself but also on the choice of other Agents. All Agents can pursue a common optimum solution and try to realize the united intention as a whole to a maximum limit. The robot moves to its goal, depending on the present positions of the other robots that cooperate with it and the present position of the ball. One of these robots cooperating with it is controlled to move by man with a joystick. In this way, Agent can be ensured to search for each state-action as frequently as possible when it carries on choosing movements, so as to shorten the time of searching for the movement space so that the convergence speed of reinforcement learning can be improved. The validity of the proposed cooperative strategy for the robot soccer has been proved by combining theoretical analysis with simulation robot soccer match (11vs11 .

  15. Lack of effect of Pitressin on the learning ability of Brattleboro rats with diabetes insipidus using positively reinforced operant conditioning.

    Science.gov (United States)

    Laycock, J F; Gartside, I B

    1985-08-01

    Brattleboro rats with hereditary hypothalamic diabetes insipidus (BDI) received daily subcutaneous injections of vasopressin in the form of Pitressin tannate (0.5 IU/24 hr). They were initially deprived of food and then trained to work for food reward in a Skinner box to a fixed ratio of ten presses for each pellet received. Once this schedule had been learned the rats were given a discrimination task daily for seven days. The performances of these BDI rats were compared with those of rats of the parent Long Evans (LE) strain receiving daily subcutaneous injections of vehicle (arachis oil). Comparisons were also made between these two groups of treated animals and untreated BDI and LE rats studied under similar conditions. In the initial learning trial, both control and Pitressin-treated BDI rats performed significantly better, and manifested less fear initially, than the control or vehicle-injected LE rats when first placed in the Skinner box. Once the initial task had been learned there was no marked difference in the discrimination learning between control or treated BDI and LE animals. These results support the view that vasopressin is not directly involved in all types of learning behaviour, particularly those involving positively reinforced operant conditioning.

  16. Temporal contingency.

    Science.gov (United States)

    Gallistel, C R; Craig, Andrew R; Shahan, Timothy A

    2014-01-01

    Contingency, and more particularly temporal contingency, has often figured in thinking about the nature of learning. However, it has never been formally defined in such a way as to make it a measure that can be applied to most animal learning protocols. We use elementary information theory to define contingency in such a way as to make it a measurable property of almost any conditioning protocol. We discuss how making it a measurable construct enables the exploration of the role of different contingencies in the acquisition and performance of classically and operantly conditioned behavior. Copyright © 2013 Elsevier B.V. All rights reserved.

  17. Temporal contingency

    Science.gov (United States)

    Gallistel, C.R.; Craig, Andrew R.; Shahan, Timothy A.

    2015-01-01

    Contingency, and more particularly temporal contingency, has often figured in thinking about the nature of learning. However, it has never been formally defined in such a way as to make it a measure that can be applied to most animal learning protocols. We use elementary information theory to define contingency in such a way as to make it a measurable property of almost any conditioning protocol. We discuss how making it a measurable construct enables the exploration of the role of different contingencies in the acquisition and performance of classically and operantly conditioned behavior. PMID:23994260

  18. Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints

    Science.gov (United States)

    Yang, Xiong; Liu, Derong; Wang, Ding

    2014-03-01

    In this paper, an adaptive reinforcement learning-based solution is developed for the infinite-horizon optimal control problem of constrained-input continuous-time nonlinear systems in the presence of nonlinearities with unknown structures. Two different types of neural networks (NNs) are employed to approximate the Hamilton-Jacobi-Bellman equation. That is, an recurrent NN is constructed to identify the unknown dynamical system, and two feedforward NNs are used as the actor and the critic to approximate the optimal control and the optimal cost, respectively. Based on this framework, the action NN and the critic NN are tuned simultaneously, without the requirement for the knowledge of system drift dynamics. Moreover, by using Lyapunov's direct method, the weights of the action NN and the critic NN are guaranteed to be uniformly ultimately bounded, while keeping the closed-loop system stable. To demonstrate the effectiveness of the present approach, simulation results are illustrated.

  19. Observational fear learning in degus is correlated with temporal vocalization patterns.

    Science.gov (United States)

    Lidhar, Navdeep K; Insel, Nathan; Dong, June Yue; Takehara-Nishiuchi, Kaori

    2017-08-14

    Some animals learn to fear a situation after observing another individual come to harm, and this learning is influenced by the animals' social relationship and history. An important but sometimes overlooked factor in studies of observational fear learning is that social context not only affects observers, but may also influence the behavior and communications expressed by those being observed. Here we sought to investigate whether observational fear learning in the degu (Octodon degus) is affected by social familiarity, and the degree to which vocal expressions of alarm or distress contribute. 'Demonstrator' degus underwent contextual fear conditioning in the presence of a cagemate or stranger observer. Among the 15 male pairs, observers of familiar demonstrators exhibited higher freezing rates than observers of strangers when returned to the conditioning environment one day later. Observer freezing during testing was, however, also related to the proportion of short- versus long- inter-call-intervals (ICIs) in vocalizations recorded during prior conditioning. In a regression model that included both social relationship and ICI patterns, only the latter was significant. Further investigation of vocalizations, including use of a novel, directed k-means clustering approach, suggested that temporal structure rather than tonal variations may have been responsible for communicating danger. These data offer insight into how different expressions of distress or fear may impact an observer, adding to the complexity of social context effects in studies of empathy and social cognition. The experiments also offer new data on degu alarm calls and a potentially novel methodological approach to complex vocalizations. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. A Psychophysical Investigation of Differences between Synchrony and Temporal Order Judgments

    Science.gov (United States)

    Love, Scott A.; Petrini, Karin; Cheng, Adam; Pollick, Frank E.

    2013-01-01

    Background Synchrony judgments involve deciding whether cues to an event are in synch or out of synch, while temporal order judgments involve deciding which of the cues came first. When the cues come from different sensory modalities these judgments can be used to investigate multisensory integration in the temporal domain. However, evidence indicates that that these two tasks should not be used interchangeably as it is unlikely that they measure the same perceptual mechanism. The current experiment further explores this issue across a variety of different audiovisual stimulus types. Methodology/Principal Findings Participants were presented with 5 audiovisual stimulus types, each at 11 parametrically manipulated levels of cue asynchrony. During separate blocks, participants had to make synchrony judgments or temporal order judgments. For some stimulus types many participants were unable to successfully make temporal order judgments, but they were able to make synchrony judgments. The mean points of subjective simultaneity for synchrony judgments were all video-leading, while those for temporal order judgments were all audio-leading. In the within participants analyses no correlation was found across the two tasks for either the point of subjective simultaneity or the temporal integration window. Conclusions Stimulus type influenced how the two tasks differed; nevertheless, consistent differences were found between the two tasks regardless of stimulus type. Therefore, in line with previous work, we conclude that synchrony and temporal order judgments are supported by different perceptual mechanisms and should not be interpreted as being representative of the same perceptual process. PMID:23349971

  1. A psychophysical investigation of differences between synchrony and temporal order judgments.

    Directory of Open Access Journals (Sweden)

    Scott A Love

    Full Text Available BACKGROUND: Synchrony judgments involve deciding whether cues to an event are in synch or out of synch, while temporal order judgments involve deciding which of the cues came first. When the cues come from different sensory modalities these judgments can be used to investigate multisensory integration in the temporal domain. However, evidence indicates that that these two tasks should not be used interchangeably as it is unlikely that they measure the same perceptual mechanism. The current experiment further explores this issue across a variety of different audiovisual stimulus types. METHODOLOGY/PRINCIPAL FINDINGS: Participants were presented with 5 audiovisual stimulus types, each at 11 parametrically manipulated levels of cue asynchrony. During separate blocks, participants had to make synchrony judgments or temporal order judgments. For some stimulus types many participants were unable to successfully make temporal order judgments, but they were able to make synchrony judgments. The mean points of subjective simultaneity for synchrony judgments were all video-leading, while those for temporal order judgments were all audio-leading. In the within participants analyses no correlation was found across the two tasks for either the point of subjective simultaneity or the temporal integration window. CONCLUSIONS: Stimulus type influenced how the two tasks differed; nevertheless, consistent differences were found between the two tasks regardless of stimulus type. Therefore, in line with previous work, we conclude that synchrony and temporal order judgments are supported by different perceptual mechanisms and should not be interpreted as being representative of the same perceptual process.

  2. Striatal and Tegmental Neurons Code Critical Signals for Temporal-Difference Learning of State Value in Domestic Chicks

    Directory of Open Access Journals (Sweden)

    Chentao Wen

    2016-11-01

    Full Text Available To ensure survival, animals must update the internal representations of their environment in a trial-and-error fashion. Psychological studies of associative learning and neurophysiological analyses of dopaminergic neurons have suggested that this updating process involves the temporal-difference (TD method in the basal ganglia network. However, the way in which the component variables of the TD method are implemented at the neuronal level is unclear. To investigate the underlying neural mechanisms, we trained domestic chicks to associate color cues with food rewards. We recorded neuronal activities from the medial striatum or tegmentum in a freely behaving condition and examined how reward omission changed neuronal firing. To compare neuronal activities with the signals assumed in the TD method, we simulated the behavioral task in the form of a finite sequence composed of discrete steps of time. The three signals assumed in the simulated task were the prediction signal, the target signal for updating, and the TD-error signal. In both the medial striatum and tegmentum, the majority of recorded neurons were categorized into three types according to their fitness for three models, though these neurons tended to form a continuum spectrum without distinct differences in the firing rate. Specifically, two types of striatal neurons successfully mimicked the target signal and the prediction signal. A linear summation of these two types of striatum neurons was a good fit for the activity of one type of tegmental neurons mimicking the TD-error signal. The present study thus demonstrates that the striatum and tegmentum can convey the signals critically required for the TD method. Based on the theoretical and neurophysiological studies, together with tract-tracing data, we propose a novel model to explain how the convergence of signals represented in the striatum could lead to the computation of TD error in tegmental dopaminergic neurons.

  3. Felder-Soloman's Index of Learning Styles: internal consistency, temporal stability, and factor structure.

    Science.gov (United States)

    Hosford, Charles C; Siders, William A

    2010-10-01

    Strategies to facilitate learning include using knowledge of students' learning style preferences to inform students and their teachers. Aims of this study were to evaluate the factor structure, internal consistency, and temporal stability of medical student responses to the Index of Learning Styles (ILS) and determine its appropriateness as an instrument for medical education. The ILS assesses preferences on four dimensions: sensing/intuitive information perceiving, visual/verbal information receiving, active/reflective information processing, and sequential/global information understanding. Students entering the 2002-2007 classes completed the ILS; some completed the ILS again after 2 and 4 years. Analyses of responses supported the ILS's intended structure and moderate reliability. Students had moderate preferences for sensing and visual learning. This study provides evidence supporting the appropriateness of the ILS for assessing learning style preferences in medical students.

  4. Reading a Story: Different Degrees of Learning in Different Learning Environments

    Directory of Open Access Journals (Sweden)

    Anna Maria Giannini

    2017-10-01

    Full Text Available The learning environment in which material is acquired may produce differences in delayed recall and in the elements that individuals focus on. These differences may appear even during development. In the present study, we compared three different learning environments in 450 normally developing 7-year-old children subdivided into three groups according to the type of learning environment. Specifically, children were asked to learn the same material shown in three different learning environments: reading illustrated books (TB; interacting with the same text displayed on a PC monitor and enriched with interactive activities (PC-IA; reading the same text on a PC monitor but not enriched with interactive narratives (PC-NoIA. Our results demonstrated that TB and PC-NoIA elicited better verbal memory recall. In contrast, PC-IA and PC-NoIA produced higher scores for visuo-spatial memory, enhancing memory for spatial relations, positions and colors with respect to TB. Interestingly, only TB seemed to produce a deeper comprehension of the story’s moral. Our results indicated that PC-IA offered a different type of learning that favored visual details. In this sense, interactive activities demonstrate certain limitations, probably due to information overabundance, emotional mobilization, emphasis on images and effort exerted in interactive activities. Thus, interactive activities, although entertaining, act as disruptive elements which interfere with verbal memory and deep moral comprehension.

  5. Reading a Story: Different Degrees of Learning in Different Learning Environments.

    Science.gov (United States)

    Giannini, Anna Maria; Cordellieri, Pierluigi; Piccardi, Laura

    2017-01-01

    The learning environment in which material is acquired may produce differences in delayed recall and in the elements that individuals focus on. These differences may appear even during development. In the present study, we compared three different learning environments in 450 normally developing 7-year-old children subdivided into three groups according to the type of learning environment. Specifically, children were asked to learn the same material shown in three different learning environments: reading illustrated books (TB); interacting with the same text displayed on a PC monitor and enriched with interactive activities (PC-IA); reading the same text on a PC monitor but not enriched with interactive narratives (PC-NoIA). Our results demonstrated that TB and PC-NoIA elicited better verbal memory recall. In contrast, PC-IA and PC-NoIA produced higher scores for visuo-spatial memory, enhancing memory for spatial relations, positions and colors with respect to TB. Interestingly, only TB seemed to produce a deeper comprehension of the story's moral. Our results indicated that PC-IA offered a different type of learning that favored visual details. In this sense, interactive activities demonstrate certain limitations, probably due to information overabundance, emotional mobilization, emphasis on images and effort exerted in interactive activities. Thus, interactive activities, although entertaining, act as disruptive elements which interfere with verbal memory and deep moral comprehension.

  6. Learning to make collective decisions: the impact of confidence escalation.

    Science.gov (United States)

    Mahmoodi, Ali; Bang, Dan; Ahmadabadi, Majid Nili; Bahrami, Bahador

    2013-01-01

    Little is known about how people learn to take into account others' opinions in joint decisions. To address this question, we combined computational and empirical approaches. Human dyads made individual and joint visual perceptual decision and rated their confidence in those decisions (data previously published). We trained a reinforcement (temporal difference) learning agent to get the participants' confidence level and learn to arrive at a dyadic decision by finding the policy that either maximized the accuracy of the model decisions or maximally conformed to the empirical dyadic decisions. When confidences were shared visually without verbal interaction, RL agents successfully captured social learning. When participants exchanged confidences visually and interacted verbally, no collective benefit was achieved and the model failed to predict the dyadic behaviour. Behaviourally, dyad members' confidence increased progressively and verbal interaction accelerated this escalation. The success of the model in drawing collective benefit from dyad members was inversely related to confidence escalation rate. The findings show an automated learning agent can, in principle, combine individual opinions and achieve collective benefit but the same agent cannot discount the escalation suggesting that one cognitive component of collective decision making in human may involve discounting of overconfidence arising from interactions.

  7. Effects of partial reinforcement and time between reinforced trials on terminal response rate in pigeon autoshaping.

    Science.gov (United States)

    Gottlieb, Daniel A

    2006-03-01

    Partial reinforcement often leads to asymptotically higher rates of responding and number of trials with a response than does continuous reinforcement in pigeon autoshaping. However, comparisons typically involve a partial reinforcement schedule that differs from the continuous reinforcement schedule in both time between reinforced trials and probability of reinforcement. Two experiments examined the relative contributions of these two manipulations to asymptotic response rate. Results suggest that the greater responding previously seen with partial reinforcement is primarily due to differential probability of reinforcement and not differential time between reinforced trials. Further, once established, differences in responding are resistant to a change in stimulus and contingency. Secondary response theories of autoshaped responding (theories that posit additional response-augmenting or response-attenuating mechanisms specific to partial or continuous reinforcement) cannot fully accommodate the current body of data. It is suggested that researchers who study pigeon autoshaping train animals on a common task prior to training them under different conditions.

  8. Reinforcing Saccadic Amplitude Variability

    Science.gov (United States)

    Paeye, Celine; Madelain, Laurent

    2011-01-01

    Saccadic endpoint variability is often viewed as the outcome of neural noise occurring during sensorimotor processing. However, part of this variability might result from operant learning. We tested this hypothesis by reinforcing dispersions of saccadic amplitude distributions, while maintaining constant their medians. In a first experiment we…

  9. Use of frontal lobe hemodynamics as reinforcement signals to an adaptive controller.

    Directory of Open Access Journals (Sweden)

    Marcello M DiStasio

    Full Text Available Decision-making ability in the frontal lobe (among other brain structures relies on the assignment of value to states of the animal and its environment. Then higher valued states can be pursued and lower (or negative valued states avoided. The same principle forms the basis for computational reinforcement learning controllers, which have been fruitfully applied both as models of value estimation in the brain, and as artificial controllers in their own right. This work shows how state desirability signals decoded from frontal lobe hemodynamics, as measured with near-infrared spectroscopy (NIRS, can be applied as reinforcers to an adaptable artificial learning agent in order to guide its acquisition of skills. A set of experiments carried out on an alert macaque demonstrate that both oxy- and deoxyhemoglobin concentrations in the frontal lobe show differences in response to both primarily and secondarily desirable (versus undesirable stimuli. This difference allows a NIRS signal classifier to serve successfully as a reinforcer for an adaptive controller performing a virtual tool-retrieval task. The agent's adaptability allows its performance to exceed the limits of the NIRS classifier decoding accuracy. We also show that decoding state desirabilities is more accurate when using relative concentrations of both oxyhemoglobin and deoxyhemoglobin, rather than either species alone.

  10. Concurrent schedules of wheel-running reinforcement: choice between different durations of opportunity to run in rats.

    Science.gov (United States)

    Belke, Terry W

    2006-02-01

    How do animals choose between opportunities to run of different durations? Are longer durations preferred over shorter durations because they permit a greater number of revolutions? Are shorter durations preferred because they engender higher rates of running? Will longer durations be chosen because running is less constrained? The present study reports on three experiments that attempted to address these questions. In the first experiment, five male Wistar rats chose between 10-sec and 50-sec opportunities to run on modified concurrent variable-interval (VI) schedules. Across conditions, the durations associated with the alternatives were reversed. Response, time, and reinforcer proportions did not vary from indifference. In a second experiment, eight female Long-Evans rats chose between opportunities to run of equal (30 sec) and unequal durations (10 sec and 50 sec) on concurrent variable-ratio (VR) schedules. As in Experiment 1, between presentations of equal duration conditions, 10-sec and 50-sec durations were reversed. Results showed that response, time, and reinforcer proportions on an alternative did not vary with reinforcer duration. In a third experiment, using concurrent VR schedules, durations were systematically varied to decrease the shorter duration toward 0 sec. As the shorter duration decreased, response, time, and reinforcer proportions shifted toward the longer duration. In summary, differences in durations of opportunities to run did not affect choice behavior in a manner consistent with the assumption that a longer reinforcer is a larger reinforcer.

  11. Safe robot execution in model-based reinforcement learning

    OpenAIRE

    Martínez Martínez, David; Alenyà Ribas, Guillem; Torras, Carme

    2015-01-01

    Task learning in robotics requires repeatedly executing the same actions in different states to learn the model of the task. However, in real-world domains, there are usually sequences of actions that, if executed, may produce unrecoverable errors (e.g. breaking an object). Robots should avoid repeating such errors when learning, and thus explore the state space in a more intelligent way. This requires identifying dangerous action effects to avoid including such actions in the generated plans...

  12. Predicting Pilot Behavior in Medium Scale Scenarios Using Game Theory and Reinforcement Learning

    Science.gov (United States)

    Yildiz, Yildiray; Agogino, Adrian; Brat, Guillaume

    2013-01-01

    Effective automation is critical in achieving the capacity and safety goals of the Next Generation Air Traffic System. Unfortunately creating integration and validation tools for such automation is difficult as the interactions between automation and their human counterparts is complex and unpredictable. This validation becomes even more difficult as we integrate wide-reaching technologies that affect the behavior of different decision makers in the system such as pilots, controllers and airlines. While overt short-term behavior changes can be explicitly modeled with traditional agent modeling systems, subtle behavior changes caused by the integration of new technologies may snowball into larger problems and be very hard to detect. To overcome these obstacles, we show how integration of new technologies can be validated by learning behavior models based on goals. In this framework, human participants are not modeled explicitly. Instead, their goals are modeled and through reinforcement learning their actions are predicted. The main advantage to this approach is that modeling is done within the context of the entire system allowing for accurate modeling of all participants as they interact as a whole. In addition such an approach allows for efficient trade studies and feasibility testing on a wide range of automation scenarios. The goal of this paper is to test that such an approach is feasible. To do this we implement this approach using a simple discrete-state learning system on a scenario where 50 aircraft need to self-navigate using Automatic Dependent Surveillance-Broadcast (ADS-B) information. In this scenario, we show how the approach can be used to predict the ability of pilots to adequately balance aircraft separation and fly efficient paths. We present results with several levels of complexity and airspace congestion.

  13. Optimal and Scalable Caching for 5G Using Reinforcement Learning of Space-Time Popularities

    Science.gov (United States)

    Sadeghi, Alireza; Sheikholeslami, Fatemeh; Giannakis, Georgios B.

    2018-02-01

    Small basestations (SBs) equipped with caching units have potential to handle the unprecedented demand growth in heterogeneous networks. Through low-rate, backhaul connections with the backbone, SBs can prefetch popular files during off-peak traffic hours, and service them to the edge at peak periods. To intelligently prefetch, each SB must learn what and when to cache, while taking into account SB memory limitations, the massive number of available contents, the unknown popularity profiles, as well as the space-time popularity dynamics of user file requests. In this work, local and global Markov processes model user requests, and a reinforcement learning (RL) framework is put forth for finding the optimal caching policy when the transition probabilities involved are unknown. Joint consideration of global and local popularity demands along with cache-refreshing costs allow for a simple, yet practical asynchronous caching approach. The novel RL-based caching relies on a Q-learning algorithm to implement the optimal policy in an online fashion, thus enabling the cache control unit at the SB to learn, track, and possibly adapt to the underlying dynamics. To endow the algorithm with scalability, a linear function approximation of the proposed Q-learning scheme is introduced, offering faster convergence as well as reduced complexity and memory requirements. Numerical tests corroborate the merits of the proposed approach in various realistic settings.

  14. Pre-learning stress differentially affects long-term memory for emotional words, depending on temporal proximity to the learning experience.

    Science.gov (United States)

    Zoladz, Phillip R; Clark, Brianne; Warnecke, Ashlee; Smith, Lindsay; Tabar, Jennifer; Talbot, Jeffery N

    2011-07-06

    Stress exerts a profound, yet complex, influence on learning and memory and can enhance, impair or have no effect on these processes. Here, we have examined how the administration of stress at different times before learning affects long-term (24-hr) memory for neutral and emotional information. Participants submerged their dominant hand into a bath of ice cold water (Stress) or into a bath of warm water (No stress) for 3 min. Either immediately (Exp. 1) or 30 min (Exp. 2) after the water bath manipulation, participants were presented with a list of 30 words varying in emotional valence. The next day, participants' memory for the word list was assessed via free recall and recognition tests. In both experiments, stressed participants exhibited greater blood pressure, salivary cortisol levels, and subjective pain and stress ratings than non-stressed participants in response to the water bath manipulation. Stress applied immediately prior to learning (Exp. 1) enhanced the recognition of positive words, while stress applied 30 min prior to learning (Exp. 2) impaired free recall of negative words. Participants' recognition of positive words in Experiment 1 was positively associated with their heart rate responses to the water bath manipulation, while participants' free recall of negative words in Experiment 2 was negatively associated with their blood pressure and cortisol responses to the water bath manipulation. These findings indicate that the differential effects of pre-learning stress on long-term memory may depend on the temporal proximity of the stressor to the learning experience and the emotional nature of the to-be-learned information. Copyright © 2011. Published by Elsevier Inc.

  15. Dynamic pricing and automated resource allocation for complex information services reinforcement learning and combinatorial auctions

    CERN Document Server

    Schwind, Michael; Fandel, G

    2007-01-01

    Many firms provide their customers with online information products which require limited resources such as server capacity. This book develops allocation mechanisms that aim to ensure an efficient resource allocation in modern IT-services. Recent methods of artificial intelligence, such as neural networks and reinforcement learning, and nature-oriented optimization methods, such as genetic algorithms and simulated annealing, are advanced and applied to allocation processes in distributed IT-infrastructures, e.g. grid systems. The author presents two methods, both of which using the users??? w

  16. Humanoids Learning to Walk: A Natural CPG-Actor-Critic Architecture.

    Science.gov (United States)

    Li, Cai; Lowe, Robert; Ziemke, Tom

    2013-01-01

    The identification of learning mechanisms for locomotion has been the subject of much research for some time but many challenges remain. Dynamic systems theory (DST) offers a novel approach to humanoid learning through environmental interaction. Reinforcement learning (RL) has offered a promising method to adaptively link the dynamic system to the environment it interacts with via a reward-based value system. In this paper, we propose a model that integrates the above perspectives and applies it to the case of a humanoid (NAO) robot learning to walk the ability of which emerges from its value-based interaction with the environment. In the model, a simplified central pattern generator (CPG) architecture inspired by neuroscientific research and DST is integrated with an actor-critic approach to RL (cpg-actor-critic). In the cpg-actor-critic architecture, least-square-temporal-difference based learning converges to the optimal solution quickly by using natural gradient learning and balancing exploration and exploitation. Futhermore, rather than using a traditional (designer-specified) reward it uses a dynamic value function as a stability indicator that adapts to the environment. The results obtained are analyzed using a novel DST-based embodied cognition approach. Learning to walk, from this perspective, is a process of integrating levels of sensorimotor activity and value.

  17. Observing Responses and Serial Stimuli: Searching for the Reinforcing Properties of the S-

    Science.gov (United States)

    Escobar, Rogelio; Bruner, Carlos A.

    2009-01-01

    The control exerted by a stimulus associated with an extinction component (S-) on observing responses was determined as a function of its temporal relation with the onset of the reinforcement component (S+). Lever pressing by rats was reinforced on a mixed random-interval extinction schedule. Each press on a second lever produced stimuli…

  18. Cognitive Risk Factors for Specific Learning Disorder: Processing Speed, Temporal Processing, and Working Memory.

    Science.gov (United States)

    Moll, Kristina; Göbel, Silke M; Gooch, Debbie; Landerl, Karin; Snowling, Margaret J

    2016-01-01

    High comorbidity rates between reading disorder (RD) and mathematics disorder (MD) indicate that, although the cognitive core deficits underlying these disorders are distinct, additional domain-general risk factors might be shared between the disorders. Three domain-general cognitive abilities were investigated in children with RD and MD: processing speed, temporal processing, and working memory. Since attention problems frequently co-occur with learning disorders, the study examined whether these three factors, which are known to be associated with attention problems, account for the comorbidity between these disorders. The sample comprised 99 primary school children in four groups: children with RD, children with MD, children with both disorders (RD+MD), and typically developing children (TD controls). Measures of processing speed, temporal processing, and memory were analyzed in a series of ANCOVAs including attention ratings as covariate. All three risk factors were associated with poor attention. After controlling for attention, associations with RD and MD differed: Although deficits in verbal memory were associated with both RD and MD, reduced processing speed was related to RD, but not MD; and the association with RD was restricted to processing speed for familiar nameable symbols. In contrast, impairments in temporal processing and visuospatial memory were associated with MD, but not RD. © Hammill Institute on Disabilities 2014.

  19. Temporal discounting and heart rate reactivity to stress.

    Science.gov (United States)

    Diller, James W; Patros, Connor H G; Prentice, Paula R

    2011-07-01

    Temporal discounting is the reduction of the value of a reinforcer as a function of increasing delay to its presentation. Impulsive individuals discount delayed consequences more rapidly than self-controlled individuals, and impulsivity has been related to substance abuse, gambling, and other problem behaviors. A growing body of literature has identified biological correlates of impulsivity, though little research to date has examined relations between delay discounting and markers of poor health (e.g., cardiovascular reactivity to stress). We evaluated the relation between one aspect of impulsivity, measured using a computerized temporal discounting task, and heart rate reactivity, measured as a change in heart rate from rest during a serial subtraction task. A linear regression showed that individuals who were more reactive to stress responded more impulsively (i.e., discounted delayed reinforcers more rapidly). When results were stratified by gender, the effect was observed for females, but not for males. This finding supports previous research on gender differences in cardiovascular reactivity and suggests that this type of reactivity may be an important correlate of impulsive behavior. Copyright © 2011 Elsevier B.V. All rights reserved.

  20. Translingual Literacy, Language Difference, and Matters of Agency

    Science.gov (United States)

    Lu, Min-Zhan; Horner, Bruce

    2013-01-01

    We argue that composition scholarship's defenses of language differences in student writing reinforce dominant ideology's spatial framework conceiving language difference as deviation from a norm of sameness. We argue instead for adopting a temporal-spatial framework defining difference as the norm of utterances, and defining languages,…

  1. Different brain circuits underlie motor and perceptual representations of temporal intervals

    DEFF Research Database (Denmark)

    Bueti, Doemnica; Walsh, Vincent; Frith, Christopher

    2008-01-01

    V5/MT. Our findings point to a role for the parietal cortex as an interface between sensory and motor processes and suggest that it may be a key node in translation of temporal information into action. Furthermore, we discuss the potential importance of the extrastriate cortex in processing visual......In everyday life, temporal information is used for both perception and action, but whether these two functions reflect the operation of similar or different neural circuits is unclear. We used functional magnetic resonance imaging to investigate the neural correlates of processing temporal...... information when either a motor or a perceptual representation is used. Participants viewed two identical sequences of visual stimuli and used the information differently to perform either a temporal reproduction or a temporal estimation task. By comparing brain activity evoked by these tasks and control...

  2. Oxytocin attenuates trust as a subset of more general reinforcement learning, with altered reward circuit functional connectivity in males.

    Science.gov (United States)

    Ide, Jaime S; Nedic, Sanja; Wong, Kin F; Strey, Shmuel L; Lawson, Elizabeth A; Dickerson, Bradford C; Wald, Lawrence L; La Camera, Giancarlo; Mujica-Parodi, Lilianne R

    2018-07-01

    Oxytocin (OT) is an endogenous neuropeptide that, while originally thought to promote trust, has more recently been found to be context-dependent. Here we extend experimental paradigms previously restricted to de novo decision-to-trust, to a more realistic environment in which social relationships evolve in response to iterative feedback over twenty interactions. In a randomized, double blind, placebo-controlled within-subject/crossover experiment of human adult males, we investigated the effects of a single dose of intranasal OT (40 IU) on Bayesian expectation updating and reinforcement learning within a social context, with associated brain circuit dynamics. Subjects participated in a neuroeconomic task (Iterative Trust Game) designed to probe iterative social learning while their brains were scanned using ultra-high field (7T) fMRI. We modeled each subject's behavior using Bayesian updating of belief-states ("willingness to trust") as well as canonical measures of reinforcement learning (learning rate, inverse temperature). Behavioral trajectories were then used as regressors within fMRI activation and connectivity analyses to identify corresponding brain network functionality affected by OT. Behaviorally, OT reduced feedback learning, without bias with respect to positive versus negative reward. Neurobiologically, reduced learning under OT was associated with muted communication between three key nodes within the reward circuit: the orbitofrontal cortex, amygdala, and lateral (limbic) habenula. Our data suggest that OT, rather than inspiring feelings of generosity, instead attenuates the brain's encoding of prediction error and therefore its ability to modulate pre-existing beliefs. This effect may underlie OT's putative role in promoting what has typically been reported as 'unjustified trust' in the face of information that suggests likely betrayal, while also resolving apparent contradictions with regard to OT's context-dependent behavioral effects. Copyright

  3. Punishment and psychopathy: a case-control functional MRI investigation of reinforcement learning in violent antisocial personality disordered men.

    Science.gov (United States)

    Gregory, Sarah; Blair, R James; Ffytche, Dominic; Simmons, Andrew; Kumari, Veena; Hodgins, Sheilagh; Blackwood, Nigel

    2015-02-01

    Men with antisocial personality disorder show lifelong abnormalities in adaptive decision making guided by the weighing up of reward and punishment information. Among men with antisocial personality disorder, modification of the behaviour of those with additional diagnoses of psychopathy seems particularly resistant to punishment. We did a case-control functional MRI (fMRI) study in 50 men, of whom 12 were violent offenders with antisocial personality disorder and psychopathy, 20 were violent offenders with antisocial personality disorder but not psychopathy, and 18 were healthy non-offenders. We used fMRI to measure brain activation associated with the representation of punishment or reward information during an event-related probabilistic response-reversal task, assessed with standard general linear-model-based analysis. Offenders with antisocial personality disorder and psychopathy displayed discrete regions of increased activation in the posterior cingulate cortex and anterior insula in response to punished errors during the task reversal phase, and decreased activation to all correct rewarded responses in the superior temporal cortex. This finding was in contrast to results for offenders without psychopathy and healthy non-offenders. Punishment prediction error signalling in offenders with antisocial personality disorder and psychopathy was highly atypical. This finding challenges the widely held view that such men are simply characterised by diminished neural sensitivity to punishment. Instead, this finding indicates altered organisation of the information-processing system responsible for reinforcement learning and appropriate decision making. This difference between violent offenders with antisocial personality disorder with and without psychopathy has implications for the causes of these disorders and for treatment approaches. National Forensic Mental Health Research and Development Programme, UK Ministry of Justice, Psychiatry Research Trust, NIHR

  4. Reaching control of a full-torso, modelled musculoskeletal robot using muscle synergies emergent under reinforcement learning

    International Nuclear Information System (INIS)

    Diamond, A; Holland, O E

    2014-01-01

    ‘Anthropomimetic’ robots mimic both human morphology and internal structure—skeleton, muscles, compliance and high redundancy—thus presenting a formidable challenge to conventional control. Here we derive a novel controller for this class of robot which learns effective reaching actions through the sustained activation of weighted muscle synergies, an approach which draws upon compelling, recent evidence from animal and human studies, but is almost unexplored to date in the musculoskeletal robot literature. Since the effective synergy patterns for a given robot will be unknown, we derive a reinforcement-learning approach intended to allow their emergence, in particular those patterns aiding linearization of control. Using an extensive physics-based model of the anthropomimetic ECCERobot, we find that effective reaching actions can be learned comprising only two sequential motor co-activation patterns, each controlled by just a single common driving signal. Factor analysis shows the emergent muscle co-activations can be largely reconstructed using weighted combinations of only 13 common fragments. Testing these ‘candidate’ synergies as drivable units, the same controller now learns the reaching task both faster and better. (paper)

  5. Fluoxetine Restores Spatial Learning but Not Accelerated Forgetting in Mesial Temporal Lobe Epilepsy

    Science.gov (United States)

    Barkas, Lisa; Redhead, Edward; Taylor, Matthew; Shtaya, Anan; Hamilton, Derek A.; Gray, William P.

    2012-01-01

    Learning and memory dysfunction is the most common neuropsychological effect of mesial temporal lobe epilepsy, and because the underlying neurobiology is poorly understood, there are no pharmacological strategies to help restore memory function in these patients. We have demonstrated impairments in the acquisition of an allocentric spatial task,…

  6. Gender and hemispheric differences in temporal lobe epilepsy: a VBM study.

    Science.gov (United States)

    Santana, Maria Teresa Castilho Garcia; Jackowski, Andrea Parolin; Britto, Fernanda Dos Santos; Sandim, Gabriel Barbosa; Caboclo, Luís Otávio Sales Ferreira; Centeno, Ricardo Silva; Carrete, Henrique; Yacubian, Elza Márcia Targas

    2014-04-01

    Gender differences are recognized in the functional and anatomical organization of the human brain. Differences between genders are probably expressed early in life, when differential rates of cerebral maturation occur. Sexual dimorphism has been described in temporal lobe epilepsy with mesial temporal sclerosis (TLE-MTS). Several voxel-based morphometry (VBM) studies have shown that TLE-MTS extends beyond mesial temporal structures, and that there are differences in the extent of anatomical damage between hemispheres, although none have approached gender differences. Our aim was to investigate gender differences and anatomical abnormalities in TLE-MTS. VBM5 was employed to analyze gender and hemispheric differences in 120 patients with TLE-MTS and 50 controls. VBM abnormalities were more widespread in left-TLE; while in women changes were mostly seen in temporal areas, frontal regions were more affected in men. Our study confirmed that gender and laterality are important factors determining the nature and severity of brain damage in TLE-MTS. Differential rates of maturation between gender and hemispheres may explain the distinct areas of anatomical damage in men and women. Copyright © 2013 British Epilepsy Association. Published by Elsevier Ltd. All rights reserved.

  7. Humanoids Learning to Walk: a Natural CPG-Actor-Critic Architecture

    Directory of Open Access Journals (Sweden)

    CAI eLI

    2013-04-01

    Full Text Available The identification of learning mechanisms for locomotion has been the subject of much researchfor some time but many challenges remain. Dynamic systems theory (DST offers a novel approach to humanoid learning through environmental interaction. Reinforcement learning (RL has offered a promising method to adaptively link the dynamic system to the environment it interacts with via a reward-based value system.In this paper, we propose a model that integrates the above perspectives and applies it to the case of a humanoid (NAO robot learning to walk the ability of which emerges from its value-based interaction with the environment. In the model,a simplified central pattern generator (CPG architecture inspired by neuroscientific research and DST is integrated with an actor-critic approach to RL (cpg-actor-critic. In the cpg-actor-critic architecture, least-square-temporal-difference (LSTD based learning converges to the optimal solution quickly by using natural gradient and balancing exploration and exploitation. Futhermore, rather than using a traditional (designer-specified reward it uses a dynamic value function as a stability indicator (SI that adapts to the environment.The results obtained are analyzed and explained by using a novel DST embodied cognition approach. Learning to walk, from this perspective, is a process of integrating sensorimotor levels and value.

  8. Use of soil moisture dynamics and patterns at different spatio-temporal scales for the investigation of subsurface flow processes

    Directory of Open Access Journals (Sweden)

    T. Blume

    2009-07-01

    Full Text Available Spatial patterns as well as temporal dynamics of soil moisture have a major influence on runoff generation. The investigation of these dynamics and patterns can thus yield valuable information on hydrological processes, especially in data scarce or previously ungauged catchments. The combination of spatially scarce but temporally high resolution soil moisture profiles with episodic and thus temporally scarce moisture profiles at additional locations provides information on spatial as well as temporal patterns of soil moisture at the hillslope transect scale. This approach is better suited to difficult terrain (dense forest, steep slopes than geophysical techniques and at the same time less cost-intensive than a high resolution grid of continuously measuring sensors. Rainfall simulation experiments with dye tracers while continuously monitoring soil moisture response allows for visualization of flow processes in the unsaturated zone at these locations. Data was analyzed at different spacio-temporal scales using various graphical methods, such as space-time colour maps (for the event and plot scale and binary indicator maps (for the long-term and hillslope scale. Annual dynamics of soil moisture and decimeter-scale variability were also investigated. The proposed approach proved to be successful in the investigation of flow processes in the unsaturated zone and showed the importance of preferential flow in the Malalcahuello Catchment, a data-scarce catchment in the Andes of Southern Chile. Fast response times of stream flow indicate that preferential flow observed at the plot scale might also be of importance at the hillslope or catchment scale. Flow patterns were highly variable in space but persistent in time. The most likely explanation for preferential flow in this catchment is a combination of hydrophobicity, small scale heterogeneity in rainfall due to redistribution in the canopy and strong gradients in unsaturated conductivities leading to

  9. New learning of music after bilateral medial temporal lobe damage: evidence from an amnesic patient.

    Science.gov (United States)

    Valtonen, Jussi; Gregory, Emma; Landau, Barbara; McCloskey, Michael

    2014-01-01

    Damage to the hippocampus impairs the ability to acquire new declarative memories, but not the ability to learn simple motor tasks. An unresolved question is whether hippocampal damage affects learning for music performance, which requires motor processes, but in a cognitively complex context. We studied learning of novel musical pieces by sight-reading in a newly identified amnesic, LSJ, who was a skilled amateur violist prior to contracting herpes simplex encephalitis. LSJ has suffered virtually complete destruction of the hippocampus bilaterally, as well as extensive damage to other medial temporal lobe structures and the left anterior temporal lobe. Because of LSJ's rare combination of musical training and near-complete hippocampal destruction, her case provides a unique opportunity to investigate the role of the hippocampus for complex motor learning processes specifically related to music performance. Three novel pieces of viola music were composed and closely matched for factors contributing to a piece's musical complexity. LSJ practiced playing two of the pieces, one in each of the two sessions during the same day. Relative to a third unpracticed control piece, LSJ showed significant pre- to post-training improvement for the two practiced pieces. Learning effects were observed both with detailed analyses of correctly played notes, and with subjective whole-piece performance evaluations by string instrument players. The learning effects were evident immediately after practice and 14 days later. The observed learning stands in sharp contrast to LSJ's complete lack of awareness that the same pieces were being presented repeatedly, and to the profound impairments she exhibits in other learning tasks. Although learning in simple motor tasks has been previously observed in amnesic patients, our results demonstrate that non-hippocampal structures can support complex learning of novel musical sequences for music performance.

  10. New Learning of Music after Bilateral Medial Temporal Lobe Damage: Evidence from an Amnesic Patient

    Science.gov (United States)

    Valtonen, Jussi; Gregory, Emma; Landau, Barbara; McCloskey, Michael

    2014-01-01

    Damage to the hippocampus impairs the ability to acquire new declarative memories, but not the ability to learn simple motor tasks. An unresolved question is whether hippocampal damage affects learning for music performance, which requires motor processes, but in a cognitively complex context. We studied learning of novel musical pieces by sight-reading in a newly identified amnesic, LSJ, who was a skilled amateur violist prior to contracting herpes simplex encephalitis. LSJ has suffered virtually complete destruction of the hippocampus bilaterally, as well as extensive damage to other medial temporal lobe structures and the left anterior temporal lobe. Because of LSJ’s rare combination of musical training and near-complete hippocampal destruction, her case provides a unique opportunity to investigate the role of the hippocampus for complex motor learning processes specifically related to music performance. Three novel pieces of viola music were composed and closely matched for factors contributing to a piece’s musical complexity. LSJ practiced playing two of the pieces, one in each of the two sessions during the same day. Relative to a third unpracticed control piece, LSJ showed significant pre- to post-training improvement for the two practiced pieces. Learning effects were observed both with detailed analyses of correctly played notes, and with subjective whole-piece performance evaluations by string instrument players. The learning effects were evident immediately after practice and 14 days later. The observed learning stands in sharp contrast to LSJ’s complete lack of awareness that the same pieces were being presented repeatedly, and to the profound impairments she exhibits in other learning tasks. Although learning in simple motor tasks has been previously observed in amnesic patients, our results demonstrate that non-hippocampal structures can support complex learning of novel musical sequences for music performance. PMID:25232312

  11. New Learning of Music after Bilateral Medial Temporal Lobe Damage: Evidence from an Amnesic Patient

    Directory of Open Access Journals (Sweden)

    Jussi eValtonen

    2014-09-01

    Full Text Available Damage to the hippocampus impairs the ability to acquire new declarative memories, but not the ability to learn simple motor tasks. An unresolved question is whether hippocampal damage affects learning for music performance, which requires motor processes, but in a cognitively complex context. We studied learning of novel musical pieces by sight-reading in a newly-identified amnesic, LSJ, who was a skilled amateur violist prior to contracting herpes simplex encephalitis. LSJ has suffered virtually complete destruction of the hippocampus bilaterally, as well as extensive damage to other medial temporal lobe structures and the left anterior temporal lobe. Because of LSJ’s rare combination of musical training and near-complete hippocampal destruction, her case provides a unique opportunity to investigate the role of the hippocampus for complex motor learning processes specifically related to music performance. Three novel pieces of viola music were composed, closely matched for factors contributing to a piece’s musical complexity. LSJ practiced playing two of the pieces, one in each of two sessions during the same day. Relative to a third unpracticed control piece, LSJ showed significant pre- to post-training improvement for the two practiced pieces. Learning effects were observed both with detailed analyses of correctly played notes, and with subjective whole-piece performance evaluations by string instrument players. The learning effects were evident immediately after practice and 14 days later. The observed learning stands in sharp contrast to LSJ’s complete lack of awareness that the same pieces were being presented repeatedly, and to the profound impairments she exhibits in other learning tasks. Although learning in simple motor tasks has been previously observed in amnesic patients, our results demonstrate that non-hippocampal structures can support complex learning of novel musical sequences for music performance.

  12. Deletion of the δ opioid receptor gene impairs place conditioning but preserves morphine reinforcement.

    Science.gov (United States)

    Le Merrer, Julie; Plaza-Zabala, Ainhoa; Del Boca, Carolina; Matifas, Audrey; Maldonado, Rafael; Kieffer, Brigitte L

    2011-04-01

    Converging experimental data indicate that δ opioid receptors contribute to mediate drug reinforcement processes. Whether their contribution reflects a role in the modulation of drug reward or an implication in conditioned learning, however, has not been explored. In the present study, we investigated the impact of δ receptor gene knockout on reinforced conditioned learning under several experimental paradigms. We assessed the ability of δ receptor knockout mice to form drug-context associations with either morphine (appetitive)- or lithium (aversive)-induced Pavlovian place conditioning. We also examined the efficiency of morphine to serve as a positive reinforcer in these mice and their motivation to gain drug injections, with operant intravenous self-administration under fixed and progressive ratio schedules and at two different doses. Mutant mice showed impaired place conditioning in both appetitive and aversive conditions, indicating disrupted context-drug association. In contrast, mutant animals displayed intact acquisition of morphine self-administration and reached breaking-points comparable to control subjects. Thus, reinforcing effects of morphine and motivation to obtain the drug were maintained. Collectively, the data suggest that δ receptor activity is not involved in morphine reinforcement but facilitates place conditioning. This study reveals a novel aspect of δ opioid receptor function in addiction-related behaviors. Copyright © 2011 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.

  13. Fatigue Behavior of Steel Fiber Reinforced High-Strength Concrete under Different Stress Levels

    Science.gov (United States)

    Zhang, Chong; Gao, Danying; Gu, Zhiqiang

    2017-12-01

    The investigation was conducted to study the fatigue behavior of steel fiber reinforced high-strength concrete (SFRHSC) beams. A series of 5 SFRHSC beams was conducted flexural fatigue tests at different stress level S of 0.5, 0.55, 0.6, 0.7 and 0.8 respectively. Static test was conducted to determine the ultimate static capacity prior to fatigue tests. Fatigue modes and S-N curves were analyzed. Besides, two fatige life prediction model were analyzed and compared. It was found that stress level S significantly influenced the fatigue life of SFRHSC beams and the fatigue behavior of SFRHSC beams was mainly determined by the tensile reinforcement.

  14. Models, Entropy and Information of Temporal Social Networks

    Science.gov (United States)

    Zhao, Kun; Karsai, Márton; Bianconi, Ginestra

    Temporal social networks are characterized by heterogeneous duration of contacts, which can either follow a power-law distribution, such as in face-to-face interactions, or a Weibull distribution, such as in mobile-phone communication. Here we model the dynamics of face-to-face interaction and mobile phone communication by a reinforcement dynamics, which explains the data observed in these different types of social interactions. We quantify the information encoded in the dynamics of these networks by the entropy of temporal networks. Finally, we show evidence that human dynamics is able to modulate the information present in social network dynamics when it follows circadian rhythms and when it is interfacing with a new technology such as the mobile-phone communication technology.

  15. Fuzzy self-learning control for magnetic servo system

    Science.gov (United States)

    Tarn, J. H.; Kuo, L. T.; Juang, K. Y.; Lin, C. E.

    1994-01-01

    It is known that an effective control system is the key condition for successful implementation of high-performance magnetic servo systems. Major issues to design such control systems are nonlinearity; unmodeled dynamics, such as secondary effects for copper resistance, stray fields, and saturation; and that disturbance rejection for the load effect reacts directly on the servo system without transmission elements. One typical approach to design control systems under these conditions is a special type of nonlinear feedback called gain scheduling. It accommodates linear regulators whose parameters are changed as a function of operating conditions in a preprogrammed way. In this paper, an on-line learning fuzzy control strategy is proposed. To inherit the wealth of linear control design, the relations between linear feedback and fuzzy logic controllers have been established. The exercise of engineering axioms of linear control design is thus transformed into tuning of appropriate fuzzy parameters. Furthermore, fuzzy logic control brings the domain of candidate control laws from linear into nonlinear, and brings new prospects into design of the local controllers. On the other hand, a self-learning scheme is utilized to automatically tune the fuzzy rule base. It is based on network learning infrastructure; statistical approximation to assign credit; animal learning method to update the reinforcement map with a fast learning rate; and temporal difference predictive scheme to optimize the control laws. Different from supervised and statistical unsupervised learning schemes, the proposed method learns on-line from past experience and information from the process and forms a rule base of an FLC system from randomly assigned initial control rules.

  16. Neural correlates of temporal credit assignment in the parietal lobe.

    Directory of Open Access Journals (Sweden)

    Timothy M Gersch

    Full Text Available Empirical studies of decision making have typically assumed that value learning is governed by time, such that a reward prediction error arising at a specific time triggers temporally-discounted learning for all preceding actions. However, in natural behavior, goals must be acquired through multiple actions, and each action can have different significance for the final outcome. As is recognized in computational research, carrying out multi-step actions requires the use of credit assignment mechanisms that focus learning on specific steps, but little is known about the neural correlates of these mechanisms. To investigate this question we recorded neurons in the monkey lateral intraparietal area (LIP during a serial decision task where two consecutive eye movement decisions led to a final reward. The underlying decision trees were structured such that the two decisions had different relationships with the final reward, and the optimal strategy was to learn based on the final reward at one of the steps (the "F" step but ignore changes in this reward at the remaining step (the "I" step. In two distinct contexts, the F step was either the first or the second in the sequence, controlling for effects of temporal discounting. We show that LIP neurons had the strongest value learning and strongest post-decision responses during the transition after the F step regardless of the serial position of this step. Thus, the neurons encode correlates of temporal credit assignment mechanisms that allocate learning to specific steps independently of temporal discounting.

  17. A Reinforcement Learning Approach to Access Management in Wireless Cellular Networks

    Directory of Open Access Journals (Sweden)

    Jihun Moon

    2017-01-01

    Full Text Available In smart city applications, huge numbers of devices need to be connected in an autonomous manner. 3rd Generation Partnership Project (3GPP specifies that Machine Type Communication (MTC should be used to handle data transmission among a large number of devices. However, the data transmission rates are highly variable, and this brings about a congestion problem. To tackle this problem, the use of Access Class Barring (ACB is recommended to restrict the number of access attempts allowed in data transmission by utilizing strategic parameters. In this paper, we model the problem of determining the strategic parameters with a reinforcement learning algorithm. In our model, the system evolves to minimize both the collision rate and the access delay. The experimental results show that our scheme improves system performance in terms of the access success rate, the failure rate, the collision rate, and the access delay.

  18. Evolution of learning strategies in temporally and spatially variable environments: a review of theory.

    Science.gov (United States)

    Aoki, Kenichi; Feldman, Marcus W

    2014-02-01

    The theoretical literature from 1985 to the present on the evolution of learning strategies in variable environments is reviewed, with the focus on deterministic dynamical models that are amenable to local stability analysis, and on deterministic models yielding evolutionarily stable strategies. Individual learning, unbiased and biased social learning, mixed learning, and learning schedules are considered. A rapidly changing environment or frequent migration in a spatially heterogeneous environment favors individual learning over unbiased social learning. However, results are not so straightforward in the context of learning schedules or when biases in social learning are introduced. The three major methods of modeling temporal environmental change--coevolutionary, two-timescale, and information decay--are compared and shown to sometimes yield contradictory results. The so-called Rogers' paradox is inherent in the two-timescale method as originally applied to the evolution of pure strategies, but is often eliminated when the other methods are used. Moreover, Rogers' paradox is not observed for the mixed learning strategies and learning schedules that we review. We believe that further theoretical work is necessary on learning schedules and biased social learning, based on models that are logically consistent and empirically pertinent. Copyright © 2013 Elsevier Inc. All rights reserved.

  19. Evolution of learning strategies in temporally and spatially variable environments: A review of theory

    Science.gov (United States)

    Aoki, Kenichi; Feldman, Marcus W.

    2013-01-01

    The theoretical literature from 1985 to the present on the evolution of learning strategies in variable environments is reviewed, with the focus on deterministic dynamical models that are amenable to local stability analysis, and on deterministic models yielding evolutionarily stable strategies. Individual learning, unbiased and biased social learning, mixed learning, and learning schedules are considered. A rapidly changing environment or frequent migration in a spatially heterogeneous environment favors individual learning over unbiased social learning. However, results are not so straightforward in the context of learning schedules or when biases in social learning are introduced. The three major methods of modeling temporal environmental change – coevolutionary, two-timescale, and information decay – are compared and shown to sometimes yield contradictory results. The so-called Rogers’ paradox is inherent in the two-timescale method as originally applied to the evolution of pure strategies, but is often eliminated when the other methods are used. Moreover, Rogers’ paradox is not observed for the mixed learning strategies and learning schedules that we review. We believe that further theoretical work is necessary on learning schedules and biased social learning, based on models that are logically consistent and empirically pertinent. PMID:24211681

  20. Spatial and temporal relations in conditioned reinforcement and observing behavior

    OpenAIRE

    Bowe, Craig A.; Dinsmoor, James A.

    1983-01-01

    In Experiment 1, depressing one perch produced stimuli indicating which of two keys, if pecked, could produce food (spatial information) and depressing the other perch produced stimuli indicating whether a variable-interval or an extinction schedule was operating (temporal information). The pigeons increased the time they spent depressing the perch that produced the temporal information but did not increase the time they spent depressing the perch that produced the spatial information. In Exp...

  1. Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data.

    Science.gov (United States)

    Lewis, F L; Vamvoudakis, Kyriakos G

    2011-02-01

    Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q -learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.

  2. Advancing of Land Surface Temperature Retrieval Using Extreme Learning Machine and Spatio-Temporal Adaptive Data Fusion Algorithm

    Directory of Open Access Journals (Sweden)

    Yang Bai

    2015-04-01

    Full Text Available As a critical variable to characterize the biophysical processes in ecological environment, and as a key indicator in the surface energy balance, evapotranspiration and urban heat islands, Land Surface Temperature (LST retrieved from Thermal Infra-Red (TIR images at both high temporal and spatial resolution is in urgent need. However, due to the limitations of the existing satellite sensors, there is no earth observation which can obtain TIR at detailed spatial- and temporal-resolution simultaneously. Thus, several attempts of image fusion by blending the TIR data from high temporal resolution sensor with data from high spatial resolution sensor have been studied. This paper presents a novel data fusion method by integrating image fusion and spatio-temporal fusion techniques, for deriving LST datasets at 30 m spatial resolution from daily MODIS image and Landsat ETM+ images. The Landsat ETM+ TIR data were firstly enhanced based on extreme learning machine (ELM algorithm using neural network regression model, from 60 m to 30 m resolution. Then, the MODIS LST and enhanced Landsat ETM+ TIR data were fused by Spatio-temporal Adaptive Data Fusion Algorithm for Temperature mapping (SADFAT in order to derive high resolution synthetic data. The synthetic images were evaluated for both testing and simulated satellite images. The average difference (AD and absolute average difference (AAD are smaller than 1.7 K, where the correlation coefficient (CC and root-mean-square error (RMSE are 0.755 and 1.824, respectively, showing that the proposed method enhances the spatial resolution of the predicted LST images and preserves the spectral information at the same time.

  3. Sex differences in selecting between food and cocaine reinforcement are mediated by estrogen.

    Science.gov (United States)

    Kerstetter, Kerry A; Ballis, Maya A; Duffin-Lutgen, Stevie; Carr, Amanda E; Behrens, Alexandra M; Kippin, Tod E

    2012-11-01

    Cocaine-dependent women, relative to their male counterparts, report shorter cocaine-free periods and report transiting faster from first use to entering treatment for addiction. Similarly, preclinical studies indicate that female rats, particularly those in the estrus phase of their reproductive cycle, show increased operant responding for cocaine under a wide variety of schedules. Making maladaptive choices is a component of drug dependence, and concurrent reinforcement schedules that examine cocaine choice offers an animal model of the conditions of human drug use; therefore, the examination of sex differences in decision-making may be critical to understanding why women display a more severe profile of cocaine addiction than men. Accordingly, we assessed sex and estrous cycle differences in choice between food (45 mg grain pellets) and intravenous cocaine (0.4 or 1.0 mg/kg per infusion) reinforcement in male, female (freely cycling), and ovariectomized (OVX) females treated with either estrogen benzoate (EB; 5 μg per day) or vehicle. At both cocaine doses, intact female rats choose cocaine over food significantly more than male rats. However, the estrous cycle did not impact the level of cocaine choice in intact females. Nevertheless, OVX females treated with vehicle exhibited a substantially lower cocaine choice compared with those receiving daily EB or to intact females. These results demonstrate that intact females have a greater preference for cocaine over food compared with males. Furthermore, this higher preference is estrogen-dependent, but does not vary across the female reproductive cycle, suggesting that ovarian hormones regulate cocaine choice. The present findings indicate that there is a biological predisposition for females to forgo food reinforcement to obtain cocaine reinforcement, which may substantially contribute to women experiencing a more severe profile of cocaine addiction than men.

  4. Continuous Reinforced Concrete Beams

    DEFF Research Database (Denmark)

    Hoang, Cao Linh; Nielsen, Mogens Peter

    1996-01-01

    This report deals with stress and stiffness estimates of continuous reinforced concrete beams with different stiffnesses for negative and positive moments e.g. corresponding to different reinforcement areas in top and bottom. Such conditions are often met in practice.The moment distribution...

  5. Open Source Tools for Temporally Controlled Rodent Behavior Suitable for Electrophysiology and Optogenetic Manipulations

    Directory of Open Access Journals (Sweden)

    Nicola Solari

    2018-05-01

    Full Text Available Understanding how the brain controls behavior requires observing and manipulating neural activity in awake behaving animals. Neuronal firing is timed at millisecond precision. Therefore, to decipher temporal coding, it is necessary to monitor and control animal behavior at the same level of temporal accuracy. However, it is technically challenging to deliver sensory stimuli and reinforcers as well as to read the behavioral responses they elicit with millisecond precision. Presently available commercial systems often excel in specific aspects of behavior control, but they do not provide a customizable environment allowing flexible experimental design while maintaining high standards for temporal control necessary for interpreting neuronal activity. Moreover, delay measurements of stimulus and reinforcement delivery are largely unavailable. We combined microcontroller-based behavior control with a sound delivery system for playing complex acoustic stimuli, fast solenoid valves for precisely timed reinforcement delivery and a custom-built sound attenuated chamber using high-end industrial insulation materials. Together this setup provides a physical environment to train head-fixed animals, enables calibrated sound stimuli and precisely timed fluid and air puff presentation as reinforcers. We provide latency measurements for stimulus and reinforcement delivery and an algorithm to perform such measurements on other behavior control systems. Combined with electrophysiology and optogenetic manipulations, the millisecond timing accuracy will help interpret temporally precise neural signals and behavioral changes. Additionally, since software and hardware provided here can be readily customized to achieve a large variety of paradigms, these solutions enable an unusually flexible design of rodent behavioral experiments.

  6. Open Source Tools for Temporally Controlled Rodent Behavior Suitable for Electrophysiology and Optogenetic Manipulations.

    Science.gov (United States)

    Solari, Nicola; Sviatkó, Katalin; Laszlovszky, Tamás; Hegedüs, Panna; Hangya, Balázs

    2018-01-01

    Understanding how the brain controls behavior requires observing and manipulating neural activity in awake behaving animals. Neuronal firing is timed at millisecond precision. Therefore, to decipher temporal coding, it is necessary to monitor and control animal behavior at the same level of temporal accuracy. However, it is technically challenging to deliver sensory stimuli and reinforcers as well as to read the behavioral responses they elicit with millisecond precision. Presently available commercial systems often excel in specific aspects of behavior control, but they do not provide a customizable environment allowing flexible experimental design while maintaining high standards for temporal control necessary for interpreting neuronal activity. Moreover, delay measurements of stimulus and reinforcement delivery are largely unavailable. We combined microcontroller-based behavior control with a sound delivery system for playing complex acoustic stimuli, fast solenoid valves for precisely timed reinforcement delivery and a custom-built sound attenuated chamber using high-end industrial insulation materials. Together this setup provides a physical environment to train head-fixed animals, enables calibrated sound stimuli and precisely timed fluid and air puff presentation as reinforcers. We provide latency measurements for stimulus and reinforcement delivery and an algorithm to perform such measurements on other behavior control systems. Combined with electrophysiology and optogenetic manipulations, the millisecond timing accuracy will help interpret temporally precise neural signals and behavioral changes. Additionally, since software and hardware provided here can be readily customized to achieve a large variety of paradigms, these solutions enable an unusually flexible design of rodent behavioral experiments.

  7. Central reinforcing effects of ethanol are blocked by catalase inhibition.

    Science.gov (United States)

    Nizhnikov, Michael E; Molina, Juan C; Spear, Norman E

    2007-11-01

    Recent studies have systematically indicated that newborn rats are highly sensitive to ethanol's positive reinforcing effects. Central administrations of ethanol (25-200mg %) associated with an olfactory conditioned stimulus (CS) promote subsequent conditioned approach to the CS as evaluated through the newborn's response to a surrogate nipple scented with the CS. It has been shown that ethanol's first metabolite, acetaldehyde, exerts significant reinforcing effects in the central nervous system. A significant amount of acetaldehyde is derived from ethanol metabolism via the catalase system. In newborn rats, catalase levels are particularly high in several brain structures. The present study tested the effect of catalase inhibition on central ethanol reinforcement. In the first experiment, pups experienced lemon odor either paired or unpaired with intracisternal (IC) administrations of 100mg% ethanol. Half of the animals corresponding to each learning condition were pretreated with IC administrations of either physiological saline or a catalase inhibitor (sodium-azide). Catalase inhibition completely suppressed ethanol reinforcement in paired groups without affecting responsiveness to the CS during conditioning or responding by unpaired control groups. A second experiment tested whether these effects were specific to ethanol reinforcement or due instead to general impairment in learning and expression capabilities. Central administration of an endogenous kappa opioid receptor agonist (dynorphin A-13) was used as an alternative source of reinforcement. Inhibition of the catalase system had no effect on the reinforcing properties of dynorphin. The present results support the hypothesis that ethanol metabolism regulated by the catalase system plays a critical role in determination of ethanol reinforcement in newborn rats.

  8. The Memory Trace Supporting Lose-Shift Responding Decays Rapidly after Reward Omission and Is Distinct from Other Learning Mechanisms in Rats.

    Science.gov (United States)

    Gruber, Aaron J; Thapa, Rajat

    2016-01-01

    The propensity of animals to shift choices immediately after unexpectedly poor reinforcement outcomes is a pervasive strategy across species and tasks. We report here that the memory supporting such lose-shift responding in rats rapidly decays during the intertrial interval and persists throughout training and testing on a binary choice task, despite being a suboptimal strategy. Lose-shift responding is not positively correlated with the prevalence and temporal dependence of win-stay responding, and it is inconsistent with predictions of reinforcement learning on the task. These data provide further evidence that win-stay and lose-shift are mediated by dissociated neural mechanisms and indicate that lose-shift responding presents a potential confound for the study of choice in the many operant choice tasks with short intertrial intervals. We propose that this immediate lose-shift responding is an intrinsic feature of the brain's choice mechanisms that is engaged as a choice reflex and works in parallel with reinforcement learning and other control mechanisms to guide action selection.

  9. Dynamics of layered reinforced concrete beam on visco-elastic foundation with different resistances of concrete and reinforcement to tension and compression

    Science.gov (United States)

    Nemirovsky, Y. V.; Tikhonov, S. V.

    2018-03-01

    Originally, fundamentals of the theory of limit equilibrium and dynamic deformation of building metal and reinforced concrete structures were created by A. A. Gvozdev [1] and developed by his followers [4, 5, 6, 7, 11, 12]. Forming the basis for the calculation, the model of an ideal rigid-plastic material has enabled to determine in many cases the ultimate load bearing capacity and upper (kinematically possible) or lower (statically valid) values for a wide class of different structures with quite simple methods. At the same time, applied to concrete structures the most important property of concrete to significantly differently resist tension and compression was not taken into account [10]. This circumstance was considered in [3] for reinforced concrete beams under conditions of quasistatic loading. The deformation is often accompanied by resistance of the environment in construction practice [8, 9]. In [2], the dynamics of multi-layered concrete beams on visco-elastic foundation under the loadings of explosive type is considered. In this work we consider the case which is often encountered in practical applications when the loadings weakly change in time.

  10. South Oregon Coast Reinforcement.

    Energy Technology Data Exchange (ETDEWEB)

    United States. Bonneville Power Administration.

    1998-05-01

    The Bonneville Power Administration is proposing to build a transmission line to reinforce electrical service to the southern coast of Oregon. This FYI outlines the proposal, tells how one can learn more, and how one can share ideas and opinions. The project will reinforce Oregon`s south coast area and provide the necessary transmission for Nucor Corporation to build a new steel mill in the Coos Bay/North Bend area. The proposed plant, which would use mostly recycled scrap metal, would produce rolled steel products. The plant would require a large amount of electrical power to run the furnace used in its steel-making process. In addition to the potential steel mill, electrical loads in the south Oregon coast area are expected to continue to grow.

  11. Enhanced learning of proportional math through music training and spatial-temporal training.

    Science.gov (United States)

    Graziano, A B; Peterson, M; Shaw, G L

    1999-03-01

    It was predicted, based on a mathematical model of the cortex, that early music training would enhance spatial-temporal reasoning. We have demonstrated that preschool children given six months of piano keyboard lessons improved dramatically on spatial-temporal reasoning while children in appropriate control groups did not improve. It was then predicted that the enhanced spatial-temporal reasoning from piano keyboard training could lead to enhanced learning of specific math concepts, in particular proportional math, which is notoriously difficult to teach using the usual language-analytic methods. We report here the development of Spatial-Temporal Math Video Game software designed to teach fractions and proportional math, and its strikingly successful use in a study involving 237 second-grade children (age range six years eight months-eight years five months). Furthermore, as predicted, children given piano keyboard training along with the Math Video Game training scored significantly higher on proportional math and fractions than children given a control training along with the Math Video Game. These results were readily measured using the companion Math Video Game Evaluation Program. The training time necessary for children on the Math Video Game is very short, and they rapidly reach a high level of performance. This suggests that, as predicted, we are tapping into fundamental cortical processes of spatial-temporal reasoning. This spatial-temporal approach is easily generalized to teach other math and science concepts in a complementary manner to traditional language-analytic methods, and at a younger age. The neural mechanisms involved in thinking through fractions and proportional math during training with the Math Video Game might be investigated in EEG coherence studies along with priming by specific music.

  12. Optimisation of cognitive performance in rodent operant (touchscreen) testing: Evaluation and effects of reinforcer strength.

    Science.gov (United States)

    Phillips, Benjamin U; Heath, Christopher J; Ossowska, Zofia; Bussey, Timothy J; Saksida, Lisa M

    2017-09-01

    Operant testing is a widely used and highly effective method of studying cognition in rodents. Performance on such tasks is sensitive to reinforcer strength. It is therefore advantageous to select effective reinforcers to minimize training times and maximize experimental throughput. To quantitatively investigate the control of behavior by different reinforcers, performance of mice was tested with either strawberry milkshake or a known powerful reinforcer, super saccharin (1.5% or 2% (w/v) saccharin/1.5% (w/v) glucose/water mixture). Mice were tested on fixed (FR)- and progressive-ratio (PR) schedules in the touchscreen-operant testing system. Under an FR schedule, both the rate of responding and number of trials completed were higher in animals responding for strawberry milkshake versus super saccharin. Under a PR schedule, mice were willing to emit similar numbers of responses for strawberry milkshake and super saccharin; however, analysis of the rate of responding revealed a significantly higher rate of responding by animals reinforced with milkshake versus super saccharin. To determine the impact of reinforcer strength on cognitive performance, strawberry milkshake and super saccharin-reinforced animals were compared on a touchscreen visual discrimination task. Animals reinforced by strawberry milkshake were significantly faster to acquire the discrimination than animals reinforced by super saccharin. Taken together, these results suggest that strawberry milkshake is superior to super saccharin for operant behavioral testing and further confirms that the application of response rate analysis to multiple ratio tasks is a highly sensitive method for the detection of behavioral differences relevant to learning and motivation.

  13. Effects of neonatal inferior prefrontal and medial temporal lesions on learning the rule for delayed nonmatching-to-sample.

    Science.gov (United States)

    Málková, L; Bachevalier, J; Webster, M; Mishkin, M

    2000-01-01

    The ability of rhesus monkeys to master the rule for delayed nonmatching-to-sample (DNMS) has a protracted ontogenetic development, reaching adult levels of proficiency around 4 to 5 years of age (Bachevalier, 1990). To test the possibility that this slow development could be due, at least in part, to immaturity of the prefrontal component of a temporo-prefrontal circuit important for DNMS rule learning (Kowalska, Bachevalier, & Mishkin, 1991; Weinstein, Saunders, & Mishkin, 1988), monkeys with neonatal lesions of the inferior prefrontal convexity were compared on DNMS with both normal controls and animals given neonatal lesions of the medial temporal lobe. Consistent with our previous results (Bachevalier & Mishkin, 1994; Málková, Mishkin, & Bachevalier, 1995), the neonatal medial temporal lesions led to marked impairment in rule learning (as well as in recognition memory with long delays and list lengths) at both 3 months and 2 years of age. By contrast, the neonatal inferior convexity lesions yielded no impairment in rule-learning at 3 months and only a mild impairment at 2 years, a finding that also contrasts sharply with the marked effects of the same lesion made in adulthood. This pattern of sparing closely resembles the one found earlier after neonatal lesions to the cortical visual area TE (Bachevalier & Mishkin, 1994; Málková et al., 1995). The functional sparing at 3 months probably reflects the fact that the temporo-prefrontal circuit is nonfunctional at this early age, resulting in a total dependency on medial temporal contributions to rule learning. With further development, however, this circuit begins to provide a supplementary route for learning.

  14. Using Spatial Reinforcement Learning to Build Forest Wildfire Dynamics Models From Satellite Images

    Directory of Open Access Journals (Sweden)

    Sriram Ganapathi Subramanian

    2018-04-01

    Full Text Available Machine learning algorithms have increased tremendously in power in recent years but have yet to be fully utilized in many ecology and sustainable resource management domains such as wildlife reserve design, forest fire management, and invasive species spread. One thing these domains have in common is that they contain dynamics that can be characterized as a spatially spreading process (SSP, which requires many parameters to be set precisely to model the dynamics, spread rates, and directional biases of the elements which are spreading. We present related work in artificial intelligence and machine learning for SSP sustainability domains including forest wildfire prediction. We then introduce a novel approach for learning in SSP domains using reinforcement learning (RL where fire is the agent at any cell in the landscape and the set of actions the fire can take from a location at any point in time includes spreading north, south, east, or west or not spreading. This approach inverts the usual RL setup since the dynamics of the corresponding Markov Decision Process (MDP is a known function for immediate wildfire spread. Meanwhile, we learn an agent policy for a predictive model of the dynamics of a complex spatial process. Rewards are provided for correctly classifying which cells are on fire or not compared with satellite and other related data. We examine the behavior of five RL algorithms on this problem: value iteration, policy iteration, Q-learning, Monte Carlo Tree Search, and Asynchronous Advantage Actor-Critic (A3C. We compare to a Gaussian process-based supervised learning approach and also discuss the relation of our approach to manually constructed, state-of-the-art methods from forest wildfire modeling. We validate our approach with satellite image data of two massive wildfire events in Northern Alberta, Canada; the Fort McMurray fire of 2016 and the Richardson fire of 2011. The results show that we can learn predictive, agent

  15. Conditioned reinforcement can be mediated by either outcome-specific or general affective representations

    Directory of Open Access Journals (Sweden)

    Kathryn A Burke

    2007-11-01

    Full Text Available Conditioned reinforcers are Pavlovian cues that support the acquisition and maintenance of new instrumental responses. Responding on the basis of conditioned rather than primary reinforcers is a pervasive part of modern life, yet we have a remarkably limited understanding of what underlying associative information is triggered by these cues to guide responding. Specifically, it is not certain whether conditioned reinforcers are effective because they evoke representations of specific outcomes or because they trigger general affective states that are independent of any specific outcome. This question has important implications for how different brain circuits might be involved in conditioned reinforcement. Here, we use specialized Pavlovian training procedures, reinforcer devaluation and transreinforcer blocking, to create cues that were biased to preferentially evoke either devaluation-insensitive, general affect representations or, devaluationsensitive, outcome-specific representations. Subsequently, these cues, along with normally conditioned control cues, were presented contingent on lever pressing.We found that intact rats learned to lever press for either the outcome or the affect cues to the same extent as for a normally conditioned cue. These results demonstrate that conditioned reinforcers can guide responding through either type of associative information. Interestingly, conditioned reinforcement was abolished in rats with basolateral amygdala lesions. Consistent with the extant literature, this result suggests a general role for basolateral amygdala in conditioned reinforcement. The implications of these data, combined with recent reports from our laboratory of a more specialized role of orbitofrontal cortex in conditioned reinforcement, will be discussed.

  16. Effect of hybrid fiber reinforcement on the cracking process in fiber reinforced cementitious composites

    DEFF Research Database (Denmark)

    Pereira, Eduardo B.; Fischer, Gregor; Barros, Joaquim A.O.

    2012-01-01

    The simultaneous use of different types of fibers as reinforcement in cementitious matrix composites is typically motivated by the underlying principle of a multi-scale nature of the cracking processes in fiber reinforced cementitious composites. It has been hypothesized that while undergoing...... tensile deformations in the composite, the fibers with different geometrical and mechanical properties restrain the propagation and further development of cracking at different scales from the micro- to the macro-scale. The optimized design of the fiber reinforcing systems requires the objective...... materials is carried out by assessing directly their tensile stress-crack opening behavior. The efficiency of hybrid fiber reinforcements and the multi-scale nature of cracking processes are discussed based on the experimental results obtained, as well as the micro-mechanisms underlying the contribution...

  17. Spatiotemporal neural characterization of prediction error valence and surprise during reward learning in humans.

    Science.gov (United States)

    Fouragnan, Elsa; Queirazza, Filippo; Retzler, Chris; Mullinger, Karen J; Philiastides, Marios G

    2017-07-06

    Reward learning depends on accurate reward associations with potential choices. These associations can be attained with reinforcement learning mechanisms using a reward prediction error (RPE) signal (the difference between actual and expected rewards) for updating future reward expectations. Despite an extensive body of literature on the influence of RPE on learning, little has been done to investigate the potentially separate contributions of RPE valence (positive or negative) and surprise (absolute degree of deviation from expectations). Here, we coupled single-trial electroencephalography with simultaneously acquired fMRI, during a probabilistic reversal-learning task, to offer evidence of temporally overlapping but largely distinct spatial representations of RPE valence and surprise. Electrophysiological variability in RPE valence correlated with activity in regions of the human reward network promoting approach or avoidance learning. Electrophysiological variability in RPE surprise correlated primarily with activity in regions of the human attentional network controlling the speed of learning. Crucially, despite the largely separate spatial extend of these representations our EEG-informed fMRI approach uniquely revealed a linear superposition of the two RPE components in a smaller network encompassing visuo-mnemonic and reward areas. Activity in this network was further predictive of stimulus value updating indicating a comparable contribution of both signals to reward learning.

  18. 3D City Models with Different Temporal Characteristica

    DEFF Research Database (Denmark)

    Bodum, Lars

    2005-01-01

    traditional static city models and those models that are built for realtime applications. The difference between the city models applies both to the spatial modelling and also when using the phenomenon time in the models. If the city models are used in visualizations without any variation in time or when......-built dynamic or a model suitable for visualization in realtime, it is required that modelling is done with level-of-detail and simplification of both the aesthetics and the geometry. If a temporal characteristic is combined with a visual characteristic, the situation can easily be seen as a t/v matrix where t...... is the temporal characteristic or representation and v is the visual characteristic or representation....

  19. Individual differences in sensitivity to reward and punishment and neural activity during reward and avoidance learning.

    Science.gov (United States)

    Kim, Sang Hee; Yoon, HeungSik; Kim, Hackjin; Hamann, Stephan

    2015-09-01

    In this functional neuroimaging study, we investigated neural activations during the process of learning to gain monetary rewards and to avoid monetary loss, and how these activations are modulated by individual differences in reward and punishment sensitivity. Healthy young volunteers performed a reinforcement learning task where they chose one of two fractal stimuli associated with monetary gain (reward trials) or avoidance of monetary loss (avoidance trials). Trait sensitivity to reward and punishment was assessed using the behavioral inhibition/activation scales (BIS/BAS). Functional neuroimaging results showed activation of the striatum during the anticipation and reception periods of reward trials. During avoidance trials, activation of the dorsal striatum and prefrontal regions was found. As expected, individual differences in reward sensitivity were positively associated with activation in the left and right ventral striatum during reward reception. Individual differences in sensitivity to punishment were negatively associated with activation in the left dorsal striatum during avoidance anticipation and also with activation in the right lateral orbitofrontal cortex during receiving monetary loss. These results suggest that learning to attain reward and learning to avoid loss are dependent on separable sets of neural regions whose activity is modulated by trait sensitivity to reward or punishment. © The Author (2015). Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  20. Temporal stability of novelty exploration in mice exposed to different open field tests.

    Science.gov (United States)

    Kalueff, Allan V; Keisala, Tiina; Minasyan, Anna; Kuuslahti, Marianne; Tuohimaa, Pentti

    2006-03-01

    We investigated behavioural activity and temporal distribution (patterning) of mouse exploration in different open field (OF) arenas. Mice of 129S1 (S1) strain were subjected in parallel to three different OF arenas (Experiment 1), two different OF arenas in two trials (Experiment 2) or two trials of the same OF test (Experiment 3). Overall, mice demonstrated a high degree of similarity in the temporal profile of novelty-induced horizontal and vertical exploration (regardless of the size, colour and shape of the OF), which remained stable in subsequent OF exposures. In Experiments 4 and 5, we tested F1 hybrid mice (BALB/c-S1; NMRI-S1), and Vitamin D receptor knockout mice (generated on S1 genetic background), again showing strikingly similar temporal patterns of their OF exploration, despite marked behavioural strain differences in anxiety and activity. These results suggest that mice are characterised by stability of temporal organization of their exploration in different OF novelty situations.

  1. Verbal learning and memory outcome in selective amygdalohippocampectomy versus temporal lobe resection in patients with hippocampal sclerosis.

    Science.gov (United States)

    Foged, Mette Thrane; Vinter, Kirsten; Stauning, Louise; Kjær, Troels W; Ozenne, Brice; Beniczky, Sándor; Paulson, Olaf B; Madsen, Flemming Find; Pinborg, Lars H

    2018-02-01

    With the advent of new very selective techniques like thermal laser ablation to treat drug-resistant focal epilepsy, the controversy of resection size in relation to seizure outcome versus cognitive deficits has gained new relevance. The purpose of this study was to test the influence of the selective amygdalohippocampectomy (SAH) versus nonselective temporal lobe resection (TLR) on seizure outcome and cognition in patients with mesial temporal lobe epilepsy (MTLE) and histopathological verified hippocampal sclerosis (HS). We identified 108 adults (>16years) with HS, operated between 1995 and 2009 in Denmark. Exclusion criteria are the following: Intelligence below normal range, right hemisphere dominance, other native languages than Danish, dual pathology, and missing follow-up data. Thus, 56 patients were analyzed. The patients were allocated to SAH (n=22) or TLR (n=34) based on intraoperative electrocorticography. Verbal learning and verbal memory were tested pre- and postsurgery. Seizure outcome did not differ between patients operated using the SAH versus the TLR at 1year (p=0.951) nor at 7years (p=0.177). Verbal learning was more affected in patients resected in the left hemisphere than in the right (p=0.002). In patients with left-sided TLR, a worsening in verbal memory performance was found (p=0.011). Altogether, 73% were seizure-free for 1year and 64% for 7years after surgery. In patients with drug-resistant focal MTLE, HS and no magnetic resonance imaging (MRI) signs of dual pathology, selective amygdalohippocampectomy results in sustained seizure freedom and better memory function compared with patients operated with nonselective temporal lobe resection. Copyright © 2017 Elsevier Inc. All rights reserved.

  2. Influence of transverse reinforcement on perforation resistance of reinforced concrete slabs under hard missile impact

    International Nuclear Information System (INIS)

    Orbovic, Nebojsa; Sagals, Genadijs; Blahoianu, Andrei

    2015-01-01

    This paper describes the work conducted by the Canadian Nuclear Safety Commission (CNSC) related to the influence of transverse reinforcement on perforation capacity of reinforced concrete (RC) slabs under “hard” missile impact (impact with negligible missile deformations). The paper presents the results of three tests on reinforced concrete slabs conducted at VTT Technical Research Centre (Finland), along with the numerical simulations as well as a discussion of the current code provisions related to impactive loading. Transverse reinforcement is widely used for improving the shear and punching strength of concrete structures. However, the effect of this reinforcement on the perforation resistance under localized missile impact is still unclear. The goal of this paper is to fill the gap in the current literature related to this topic. Based on similar tests designed by the authors with missile velocity below perforation velocity, it was expected that transverse reinforcement would improve the perforation resistance. Three slabs were tested under almost identical conditions with the only difference being the transverse reinforcement. One slab was designed without transverse reinforcement, the second one with the transverse reinforcement in form of conventional stirrups with hooks and the third one with the transverse reinforcement in form of T-headed bars. Although the transverse reinforcement reduced the overall damage of the slabs (the rear face scabbing), the conclusion from the tests is that the transverse reinforcement does not have important influence on perforation capacity of concrete slabs under rigid missile impact. The slab with T-headed bars presented a slight improvement compared to the baseline specimen without transverse reinforcement. The slab with conventional stirrups presented slightly lower perforation capacity (higher residual missile velocity) than the slab without transverse reinforcement. In conclusion, the performed tests show slightly

  3. Learning About Time Within the Spinal Cord II: Evidence that Temporal Regularity is Encoded by a Spinal Oscillator

    Directory of Open Access Journals (Sweden)

    Kuan Hsien Lee

    2016-02-01

    Full Text Available How a stimulus impacts spinal cord function depends upon temporal relations. When intermittent noxious stimulation (shock is applied and the interval between shock pulses is varied (unpredictable, it induces a lasting alteration that inhibits adaptive learning. If the same stimulus is applied in a temporally regular (predictable manner, the capacity to learn is preserved and a protective/restorative effect is engaged that counters the adverse effect of variable stimulation. Sensitivity to temporal relations implies a capacity to encode time. This study explores how spinal neurons discriminate variable and fixed spaced stimulation. Communication with the brain was blocked by means of a spinal transection and adaptive capacity was tested using an instrumental learning task. In this task, subjects must learn to maintain a hind limb in a flexed position to minimize shock exposure. To evaluate the possibility that a distinct class of afferent fibers provide a sensory cue for regularity, we manipulated the temporal relation between shocks given to two dermatomes (leg and tail. Evidence for timing emerged when the stimuli were applied in a coherent manner across dermatomes, implying that a central (spinal process detects regularity. Next, we show that fixed spaced stimulation has a restorative effect when half the physical stimuli are randomly omitted, as long as the stimuli remain in phase, suggesting that stimulus regularity is encoded by an internal oscillator Research suggests that the oscillator that drives the tempo of stepping depends upon neurons within the rostral lumbar (L1-L2 region. Disrupting communication with the L1-L2 tissue by means of a L3 transection eliminated the restorative effect of fixed spaced stimulation. Implications of the results for step training and rehabilitation after injury are discussed.

  4. Investigation of a Reinforcement-Based Toilet Training Procedure for Children with Autism.

    Science.gov (United States)

    Cicero, Frank R.; Pfadt, Al

    2002-01-01

    This study evaluated the effectiveness of a reinforcement-based toilet training intervention with three children with autism. Procedures included positive reinforcement, graduated guidance, scheduled practice trials, and forward prompting. All three children reduced urination accidents to zero and learned to request bathroom use spontaneously…

  5. Neuronal representations of stimulus associations develop in the temporal lobe during learning.

    Science.gov (United States)

    Messinger, A; Squire, L R; Zola, S M; Albright, T D

    2001-10-09

    Visual stimuli that are frequently seen together become associated in long-term memory, such that the sight of one stimulus readily brings to mind the thought or image of the other. It has been hypothesized that acquisition of such long-term associative memories proceeds via the strengthening of connections between neurons representing the associated stimuli, such that a neuron initially responding only to one stimulus of an associated pair eventually comes to respond to both. Consistent with this hypothesis, studies have demonstrated that individual neurons in the primate inferior temporal cortex tend to exhibit similar responses to pairs of visual stimuli that have become behaviorally associated. In the present study, we investigated the role of these areas in the formation of conditional visual associations by monitoring the responses of individual neurons during the learning of new stimulus pairs. We found that many neurons in both area TE and perirhinal cortex came to elicit more similar neuronal responses to paired stimuli as learning proceeded. Moreover, these neuronal response changes were learning-dependent and proceeded with an average time course that paralleled learning. This experience-dependent plasticity of sensory representations in the cerebral cortex may underlie the learning of associations between objects.

  6. Behavioral evidence for differences in social and non-social category learning

    Directory of Open Access Journals (Sweden)

    Lucile eGamond

    2012-08-01

    Full Text Available When meeting someone for the very first time one spontaneously categorizes the seen person on the basis of his/her appearance. Categorization is based on the association between some physical features and category labels that can be social (character trait… or non-social (tall, thin. Surprisingly little is known about how such associations are formed, particularly in the social domain. Here, we aimed at testing whether social and non-social category learning may be dissociated. We presented subjects with a large number of faces that had to be rated according to social or non-social labels, and induced an association between a facial feature (inter-eye distance and the category labels using two different procedures. In a first experiment, we used a feedback procedure to reinforce the association; behavioral measures revealed an association between the physical feature manipulated and abstract non-social categories, while no evidence for an association with social labels could be found. In a second experiment, we used passive exposure to the association between physical features and labels; we obtained behavioral evidence for learning of both social and non-social categories. These results support the view of the specificity of social category learning; they suggest that social categories are best acquired through unsupervised procedures that can be considered as a simplified proxy for group transmission.

  7. The role of multisensor data fusion in neuromuscular control of a sagittal arm with a pair of muscles using actor-critic reinforcement learning method.

    Science.gov (United States)

    Golkhou, V; Parnianpour, M; Lucas, C

    2004-01-01

    In this study, we consider the role of multisensor data fusion in neuromuscular control using an actor-critic reinforcement learning method. The model we use is a single link system actuated by a pair of muscles that are excited with alpha and gamma signals. Various physiological sensor information such as proprioception, spindle sensors, and Golgi tendon organs have been integrated to achieve an oscillatory movement with variable amplitude and frequency, while achieving a stable movement with minimum metabolic cost and coactivation. The system is highly nonlinear in all its physical and physiological attributes. Transmission delays are included in the afferent and efferent neural paths to account for a more accurate representation of the reflex loops. This paper proposes a reinforcement learning method with an Actor-Critic architecture instead of middle and low level of central nervous system (CNS). The Actor in this structure is a two layer feedforward neural network and the Critic is a model of the cerebellum. The Critic is trained by the State-Action-Reward-State-Action (SARSA) method. The Critic will train the Actor by supervisory learning based on previous experiences. The reinforcement signal in SARSA is evaluated based on available alternatives concerning the concept of multisensor data fusion. The effectiveness and the biological plausibility of the present model are demonstrated by several simulations. The system showed excellent tracking capability when we integrated the available sensor information. Addition of a penalty for activation of muscles resulted in much lower muscle coactivation while keeping the movement stable.

  8. Properties of discontinuous S2-glass fiber-particulate-reinforced resin composites with two different fiber length distributions.

    Science.gov (United States)

    Huang, Qiting; Garoushi, Sufyan; Lin, Zhengmei; He, Jingwei; Qin, Wei; Liu, Fang; Vallittu, Pekka Kalevi; Lassila, Lippo Veli Juhana

    2017-10-01

    To investigate the reinforcing efficiency and light curing properties of discontinuous S2-glass fiber-particulate reinforced resin composite and to examine length distribution of discontinuous S2-glass fibers after a mixing process into resin composite. Experimental S2-glass fiber-particulate reinforced resin composites were prepared by mixing 10wt% of discontinuous S2-glass fibers, which had been manually cut into two different lengths (1.5 and 3.0mm), with various weight ratios of dimethacrylate based resin matrix and silaned BaAlSiO 2 filler particulates. The resin composite made with 25wt% of UDMA/SR833s resin system and 75wt% of silaned BaAlSiO 2 filler particulates was used as control composite which had similar composition as the commonly used resin composites. Flexural strength (FS), flexural modulus (FM) and work of fracture (WOF) were measured. Fractured specimens were observed by scanning electron microscopy. Double bond conversion (DC) and fiber length distribution were also studied. Reinforcement of resin composites with discontinuous S2-glass fibers can significantly increase the FS, FM and WOF of resin composites over the control. The fibers from the mixed resin composites showed great variation in final fiber length. The mean aspect ratio of experimental composites containing 62.5wt% of particulate fillers and 10wt% of 1.5 or 3.0mm cutting S2-glass fibers was 70 and 132, respectively. No difference was found in DC between resin composites containing S2-glass fibers with two different cutting lengths. Discontinuous S2-glass fibers can effectively reinforce the particulate-filled resin composite and thus may be potential to manufacture resin composites for high-stress bearing application. Copyright © 2017. Published by Elsevier Ltd.

  9. Pupil dilation indicates the coding of past prediction errors: Evidence for attentional learning theory.

    Science.gov (United States)

    Koenig, Stephan; Uengoer, Metin; Lachnit, Harald

    2018-04-01

    The attentional learning theory of Pearce and Hall () predicts more attention to uncertain cues that have caused a high prediction error in the past. We examined how the cue-elicited pupil dilation during associative learning was linked to such error-driven attentional processes. In three experiments, participants were trained to acquire associations between different cues and their appetitive (Experiment 1), motor (Experiment 2), or aversive (Experiment 3) outcomes. All experiments were designed to examine differences in the processing of continuously reinforced cues (consistently followed by the outcome) versus partially reinforced, uncertain cues (randomly followed by the outcome). We measured the pupil dilation elicited by the cues in anticipation of the outcome and analyzed how this conditioned pupil response changed over the course of learning. In all experiments, changes in pupil size complied with the same basic pattern: During early learning, consistently reinforced cues elicited greater pupil dilation than uncertain, randomly reinforced cues, but this effect gradually reversed to yield a greater pupil dilation for uncertain cues toward the end of learning. The pattern of data accords with the changes in prediction error and error-driven attention formalized by the Pearce-Hall theory. © 2017 The Authors. Psychophysiology published by Wiley Periodicals, Inc. on behalf of Society for Psychophysiological Research.

  10. Assist-as-needed robotic trainer based on reinforcement learning and its application to dart-throwing.

    Science.gov (United States)

    Obayashi, Chihiro; Tamei, Tomoya; Shibata, Tomohiro

    2014-05-01

    This paper proposes a novel robotic trainer for motor skill learning. It is user-adaptive inspired by the assist-as-needed principle well known in the field of physical therapy. Most previous studies in the field of the robotic assistance of motor skill learning have used predetermined desired trajectories, and it has not been examined intensively whether these trajectories were optimal for each user. Furthermore, the guidance hypothesis states that humans tend to rely too much on external assistive feedback, resulting in interference with the internal feedback necessary for motor skill learning. A few studies have proposed a system that adjusts its assistive strength according to the user's performance in order to prevent the user from relying too much on the robotic assistance. There are, however, problems in these studies, in that a physical model of the user's motor system is required, which is inherently difficult to construct. In this paper, we propose a framework for a robotic trainer that is user-adaptive and that neither requires a specific desired trajectory nor a physical model of the user's motor system, and we achieve this using model-free reinforcement learning. We chose dart-throwing as an example motor-learning task as it is one of the simplest throwing tasks, and its performance can easily be and quantitatively measured. Training experiments with novices, aiming at maximizing the score with the darts and minimizing the physical robotic assistance, demonstrate the feasibility and plausibility of the proposed framework. Copyright © 2014 Elsevier Ltd. All rights reserved.

  11. Full-fledged temporal processing: bridging the gap between deep linguistic processing and temporal extraction

    Directory of Open Access Journals (Sweden)

    Francisco Costa

    2013-07-01

    Full Text Available The full-fledged processing of temporal information presents specific challenges. These difficulties largely stem from the fact that the temporal meaning conveyed by grammatical means interacts with many extra-linguistic factors (world knowledge, causality, calendar systems, reasoning. This article proposes a novel approach to this problem, based on a hybrid strategy that explores the complementarity of the symbolic and probabilistic methods. A specialized temporal extraction system is combined with a deep linguistic processing grammar. The temporal extraction system extracts eventualities, times and dates mentioned in text, and also temporal relations between them, in line with the tasks of the recent TempEval challenges; and uses machine learning techniques to draw from different sources of information (grammatical and extra-grammatical even if it is not explicitly known how these combine to produce the final temporal meaning being expressed. In turn, the deep computational grammar delivers richer truth-conditional meaning representations of input sentences, which include a principled representation of temporal information, on which higher level tasks, including reasoning, can be based. These deep semantic representations are extended and improved according to the output of the aforementioned temporal extraction module. The prototype implemented shows performance results that increase the quality of the temporal meaning representations and are better than the performance of each of the two components in isolation.

  12. Neural signals of vicarious extinction learning.

    Science.gov (United States)

    Golkar, Armita; Haaker, Jan; Selbing, Ida; Olsson, Andreas

    2016-10-01

    Social transmission of both threat and safety is ubiquitous, but little is known about the neural circuitry underlying vicarious safety learning. This is surprising given that these processes are critical to flexibly adapt to a changeable environment. To address how the expression of previously learned fears can be modified by the transmission of social information, two conditioned stimuli (CS + s) were paired with shock and the third was not. During extinction, we held constant the amount of direct, non-reinforced, exposure to the CSs (i.e. direct extinction), and critically varied whether another individual-acting as a demonstrator-experienced safety (CS + vic safety) or aversive reinforcement (CS + vic reinf). During extinction, ventromedial prefrontal cortex (vmPFC) responses to the CS + vic reinf increased but decreased to the CS + vic safety This pattern of vmPFC activity was reversed during a subsequent fear reinstatement test, suggesting a temporal shift in the involvement of the vmPFC. Moreover, only the CS + vic reinf association recovered. Our data suggest that vicarious extinction prevents the return of conditioned fear responses, and that this efficacy is reflected by diminished vmPFC involvement during extinction learning. The present findings may have important implications for understanding how social information influences the persistence of fear memories in individuals suffering from emotional disorders. © The Author (2016). Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  13. The Impact of Feedback on the Different Time Courses of Multisensory Temporal Recalibration

    Directory of Open Access Journals (Sweden)

    Matthew A. De Niear

    2017-01-01

    Full Text Available The capacity to rapidly adjust perceptual representations confers a fundamental advantage when confronted with a constantly changing world. Unexplored is how feedback regarding sensory judgments (top-down factors interacts with sensory statistics (bottom-up factors to drive long- and short-term recalibration of multisensory perceptual representations. Here, we examined the time course of both cumulative and rapid temporal perceptual recalibration for individuals completing an audiovisual simultaneity judgment task in which they were provided with varying degrees of feedback. We find that in the presence of feedback (as opposed to simple sensory exposure temporal recalibration is more robust. Additionally, differential time courses are seen for cumulative and rapid recalibration dependent upon the nature of the feedback provided. Whereas cumulative recalibration effects relied more heavily on feedback that informs (i.e., negative feedback rather than confirms (i.e., positive feedback the judgment, rapid recalibration shows the opposite tendency. Furthermore, differential effects on rapid and cumulative recalibration were seen when the reliability of feedback was altered. Collectively, our findings illustrate that feedback signals promote and sustain audiovisual recalibration over the course of cumulative learning and enhance rapid trial-to-trial learning. Furthermore, given the differential effects seen for cumulative and rapid recalibration, these processes may function via distinct mechanisms.

  14. Off-Policy Reinforcement Learning: Optimal Operational Control for Two-Time-Scale Industrial Processes.

    Science.gov (United States)

    Li, Jinna; Kiumarsi, Bahare; Chai, Tianyou; Lewis, Frank L; Fan, Jialu

    2017-12-01

    Industrial flow lines are composed of unit processes operating on a fast time scale and performance measurements known as operational indices measured at a slower time scale. This paper presents a model-free optimal solution to a class of two time-scale industrial processes using off-policy reinforcement learning (RL). First, the lower-layer unit process control loop with a fast sampling period and the upper-layer operational index dynamics at a slow time scale are modeled. Second, a general optimal operational control problem is formulated to optimally prescribe the set-points for the unit industrial process. Then, a zero-sum game off-policy RL algorithm is developed to find the optimal set-points by using data measured in real-time. Finally, a simulation experiment is employed for an industrial flotation process to show the effectiveness of the proposed method.

  15. Comparing Exploration Strategies for Q-learning in Random Stochastic Mazes

    NARCIS (Netherlands)

    Tijsma, Arryon; Drugan, Madalina; Wiering, Marco

    2016-01-01

    Balancing the ratio between exploration and exploitation is an important problem in reinforcement learning. This paper evaluates four different exploration strategies combined with Q-learning using random stochastic mazes to investigate their performances. We will compare: UCB-1, softmax,

  16. Discrete-time online learning control for a class of unknown nonaffine nonlinear systems using reinforcement learning.

    Science.gov (United States)

    Yang, Xiong; Liu, Derong; Wang, Ding; Wei, Qinglai

    2014-07-01

    In this paper, a reinforcement-learning-based direct adaptive control is developed to deliver a desired tracking performance for a class of discrete-time (DT) nonlinear systems with unknown bounded disturbances. We investigate multi-input-multi-output unknown nonaffine nonlinear DT systems and employ two neural networks (NNs). By using Implicit Function Theorem, an action NN is used to generate the control signal and it is also designed to cancel the nonlinearity of unknown DT systems, for purpose of utilizing feedback linearization methods. On the other hand, a critic NN is applied to estimate the cost function, which satisfies the recursive equations derived from heuristic dynamic programming. The weights of both the action NN and the critic NN are directly updated online instead of offline training. By utilizing Lyapunov's direct method, the closed-loop tracking errors and the NN estimated weights are demonstrated to be uniformly ultimately bounded. Two numerical examples are provided to show the effectiveness of the present approach. Copyright © 2014 Elsevier Ltd. All rights reserved.

  17. Adversarial Reinforcement Learning in a Cyber Security Simulation}

    NARCIS (Netherlands)

    Elderman, Richard; Pater, Leon; Thie, Albert; Drugan, Madalina; Wiering, Marco

    2017-01-01

    This paper focuses on cyber-security simulations in networks modeled as a Markov game with incomplete information and stochastic elements. The resulting game is an adversarial sequential decision making problem played with two agents, the attacker and defender. The two agents pit one reinforcement

  18. "The stone which the builders rejected...": Delay of reinforcement and response rate on fixed-interval and related schedules.

    Science.gov (United States)

    Wearden, J H; Lejeune, Helga

    2006-02-28

    The article deals with response rates (mainly running and peak or terminal rates) on simple and on some mixed-FI schedules and explores the idea that these rates are determined by the average delay of reinforcement for responses occurring during the response periods that the schedules generate. The effects of reinforcement delay are assumed to be mediated by a hyperbolic delay of reinforcement gradient. The account predicts that (a) running rates on simple FI schedules should increase with increasing rate of reinforcement, in a manner close to that required by Herrnstein's equation, (b) improving temporal control during acquisition should be associated with increasing running rates, (c) two-valued mixed-FI schedules with equiprobable components should produce complex results, with peak rates sometimes being higher on the longer component schedule, and (d) that effects of reinforcement probability on mixed-FI should affect the response rate at the time of the shorter component only. All these predictions were confirmed by data, although effects in some experiments remain outside the scope of the model. In general, delay of reinforcement as a determinant of response rate on FI and related schedules (rather than temporal control on such schedules) seems a useful starting point for a more thorough analysis of some neglected questions about performance on FI and related schedules.

  19. Individual differences in spatial configuration learning predict the occurrence of intrusive memories.

    Science.gov (United States)

    Meyer, Thomas; Smeets, Tom; Giesbrecht, Timo; Quaedflieg, Conny W E M; Girardelli, Marta M; Mackay, Georgina R N; Merckelbach, Harald

    2013-03-01

    The dual-representation model of posttraumatic stress disorder (PTSD; Brewin, Gregory, Lipton, & Burgess, Psychological Review, 117, 210-232 2010) argues that intrusions occur when people fail to construct context-based representations during adverse experiences. The present study tested a specific prediction flowing from this model. In particular, we investigated whether the efficiency of temporal-lobe-based spatial configuration learning would account for individual differences in intrusive experiences and physiological reactivity in the laboratory. Participants (N = 82) completed the contextual cuing paradigm, which assesses spatial configuration learning that is believed to depend on associative encoding in the parahippocampus. They were then shown a trauma film. Afterward, startle responses were quantified during presentation of trauma reminder pictures versus unrelated neutral and emotional pictures. PTSD symptoms were recorded in the week following participation. Better configuration learning performance was associated with fewer perceptual intrusions, r = -.33, p .46) and had no direct effect on intrusion-related distress and overall PTSD symptoms, rs > -.12, ps > .29. However, configuration learning performance tended to be associated with reduced physiological responses to unrelated negative images, r = -.20, p = .07. Thus, while spatial configuration learning appears to be unrelated to affective responding to trauma reminders, our overall findings support the idea that the context-based memory system helps to reduce intrusions.

  20. Believer-Skeptic Meets Actor-Critic: Rethinking the Role of Basal Ganglia Pathways during Decision-Making and Reinforcement Learning

    Science.gov (United States)

    Dunovan, Kyle; Verstynen, Timothy

    2016-01-01

    The flexibility of behavioral control is a testament to the brain's capacity for dynamically resolving uncertainty during goal-directed actions. This ability to select actions and learn from immediate feedback is driven by the dynamics of basal ganglia (BG) pathways. A growing body of empirical evidence conflicts with the traditional view that these pathways act as independent levers for facilitating (i.e., direct pathway) or suppressing (i.e., indirect pathway) motor output, suggesting instead that they engage in a dynamic competition during action decisions that computationally captures action uncertainty. Here we discuss the utility of encoding action uncertainty as a dynamic competition between opposing control pathways and provide evidence that this simple mechanism may have powerful implications for bridging neurocomputational theories of decision making and reinforcement learning. PMID:27047328

  1. A Novel Dynamic Spectrum Access Framework Based on Reinforcement Learning for Cognitive Radio Sensor Networks

    Directory of Open Access Journals (Sweden)

    Yun Lin

    2016-10-01

    Full Text Available Cognitive radio sensor networks are one of the kinds of application where cognitive techniques can be adopted and have many potential applications, challenges and future research trends. According to the research surveys, dynamic spectrum access is an important and necessary technology for future cognitive sensor networks. Traditional methods of dynamic spectrum access are based on spectrum holes and they have some drawbacks, such as low accessibility and high interruptibility, which negatively affect the transmission performance of the sensor networks. To address this problem, in this paper a new initialization mechanism is proposed to establish a communication link and set up a sensor network without adopting spectrum holes to convey control information. Specifically, firstly a transmission channel model for analyzing the maximum accessible capacity for three different polices in a fading environment is discussed. Secondly, a hybrid spectrum access algorithm based on a reinforcement learning model is proposed for the power allocation problem of both the transmission channel and the control channel. Finally, extensive simulations have been conducted and simulation results show that this new algorithm provides a significant improvement in terms of the tradeoff between the control channel reliability and the efficiency of the transmission channel.

  2. Optimized Assistive Human-Robot Interaction Using Reinforcement Learning.

    Science.gov (United States)

    Modares, Hamidreza; Ranatunga, Isura; Lewis, Frank L; Popa, Dan O

    2016-03-01

    An intelligent human-robot interaction (HRI) system with adjustable robot behavior is presented. The proposed HRI system assists the human operator to perform a given task with minimum workload demands and optimizes the overall human-robot system performance. Motivated by human factor studies, the presented control structure consists of two control loops. First, a robot-specific neuro-adaptive controller is designed in the inner loop to make the unknown nonlinear robot behave like a prescribed robot impedance model as perceived by a human operator. In contrast to existing neural network and adaptive impedance-based control methods, no information of the task performance or the prescribed robot impedance model parameters is required in the inner loop. Then, a task-specific outer-loop controller is designed to find the optimal parameters of the prescribed robot impedance model to adjust the robot's dynamics to the operator skills and minimize the tracking error. The outer loop includes the human operator, the robot, and the task performance details. The problem of finding the optimal parameters of the prescribed robot impedance model is transformed into a linear quadratic regulator (LQR) problem which minimizes the human effort and optimizes the closed-loop behavior of the HRI system for a given task. To obviate the requirement of the knowledge of the human model, integral reinforcement learning is used to solve the given LQR problem. Simulation results on an x - y table and a robot arm, and experimental implementation results on a PR2 robot confirm the suitability of the proposed method.

  3. Performance of Reinforced Concrete Beam with Differently Positioned Replacement Zones of Block Infill under Low Impact Loads

    Directory of Open Access Journals (Sweden)

    Mokhatar Shahrul Niza

    2017-01-01

    Full Text Available This paper reveals a study performed on reinforced concrete with artificial aggregate concrete block infill composite beams to innovate a lightweight reinforced concrete utilizing polyethylene (PE waste materials, such as waste plastic bags. Six beam specimens of normal reinforced concrete (NRC and different block infill replacement zone positions RCAI (RZ1 beams containing 100% MAPEA with 50, 95, and 1,000 mm width, height, and length, respectively, were provided for the block infill, whereas RCAI (RZ2 with different block infill positions containing a 100% MAPEA with 50, 115, and 1000 mm width, height, and length were provided and tested under low impact load. The steel impactor with blunt nose dropped at 0.6 m height which equivalent to 3.5 m/s. The behaviors of the beams were studied relative to the impact force-time and displacement-time histories, the flexural/ bending cracks, and the impact failure. Results show that the overall failure modes of all the beam specimens were successfully recorded. In addition, the residual displacements of the RZ2 was almost same than those of the RZ1 and the significantly lower than those of the NRC. In the reinforced concrete beams, less stressed concrete near the neutral axis can be replaced by certain light weight material like waste plastic bags as modified artificial polyethylene aggregates to serve as an artificial aggregate.

  4. ABOUT INFLUENCE OF DIFFERENT SCHEMES IMPACT RADIATION ENVIRONMENTS AND LOADS ON REINFORCED LAMELLAR STRUCTURAL MEMBERS

    Directory of Open Access Journals (Sweden)

    Rafail B. Garibov

    2017-12-01

    Full Text Available The article discusses the model of deformation of fiber-reinforced concrete rectangular plate under the influence of radiation environments. In the calculation of the plate was considered different schemes impact of the applied external loads and radiation environments.

  5. A study on the estimation method of internal stresses caused by the difference of thermal expansion coefficients between concrete and reinforcement at elevated temperatures

    International Nuclear Information System (INIS)

    Kanazu, Tsutomu

    1998-01-01

    When a reinforced concrete member is exposed to high temperature conditions over 100degC, tensile strain occurs in the concrete and compressive strain occurs in reinforcements due to a difference of thermal expansion coefficients between concrete and reinforcement. Its mechanism is the same as that of restrained stress caused by drying shrinkage of concrete; tensile stress occurs in the concrete because drying shrinkage strain is restrained by reinforcements, but there is a different point that the phenomenon at a high temperature condition includes the change of mechanical properties of concrete and reinforcement. In the study, the phenomenon is measured in the experiments and is clarified quantitatively. Moreover, the estimation method, which is derived from expanding the equation of average strain of reinforcement in the CEB Design Manual, is suggested and is verified by the comparison with the experimental results. (author)

  6. Online Structural-Health Monitoring of Glass Fiber-Reinforced Thermoplastics Using Different Carbon Allotropes in the Interphase

    Directory of Open Access Journals (Sweden)

    Michael Thomas Müller

    2018-06-01

    Full Text Available An electromechanical response behavior is realized by nanostructuring the glass fiber interphase with different highly electrically conductive carbon allotropes like carbon nanotubes (CNT, graphene nanoplatelets (GNP, or conductive carbon black (CB. The operational capability of these multifunctional glass fibers for an online structural-health monitoring is demonstrated in endless glass fiber-reinforced polypropylene. The electromechanical response behavior, during a static or dynamic three-point bending test of various carbon modifications, shows qualitative differences in the signal quality and sensitivity due to the different aspect ratios of the nanoparticles and the associated electrically conductive network densities in the interphase. Depending on the embedding position within the glass fiber-reinforced composite compression, shear and tension loadings of the fibers can be distinguished by different characteristics of the corresponding electrical signal. The occurrence of irreversible signal changes during the dynamic loading can be attributed to filler reorientation processes caused by polymer creeping or by destruction of electrically conductive paths by cracks in the glass fiber interphase.

  7. Aversive reinforcement improves visual discrimination learning in free-flying honeybees.

    Directory of Open Access Journals (Sweden)

    Aurore Avarguès-Weber

    Full Text Available BACKGROUND: Learning and perception of visual stimuli by free-flying honeybees has been shown to vary dramatically depending on the way insects are trained. Fine color discrimination is achieved when both a target and a distractor are present during training (differential conditioning, whilst if the same target is learnt in isolation (absolute conditioning, discrimination is coarse and limited to perceptually dissimilar alternatives. Another way to potentially enhance discrimination is to increase the penalty associated with the distractor. Here we studied whether coupling the distractor with a highly concentrated quinine solution improves color discrimination of both similar and dissimilar colors by free-flying honeybees. As we assumed that quinine acts as an aversive stimulus, we analyzed whether aversion, if any, is based on an aversive sensory input at the gustatory level or on a post-ingestional malaise following quinine feeding. METHODOLOGY/PRINCIPAL FINDINGS: We show that the presence of a highly concentrated quinine solution (60 mM acts as an aversive reinforcer promoting rejection of the target associated with it, and improving discrimination of perceptually similar stimuli but not of dissimilar stimuli. Free-flying bees did not use remote cues to detect the presence of quinine solution; the aversive effect exerted by this substance was mediated via a gustatory input, i.e. via a distasteful sensory experience, rather than via a post-ingestional malaise. CONCLUSION: The present study supports the hypothesis that aversion conditioning is important for understanding how and what animals perceive and learn. By using this form of conditioning coupled with appetitive conditioning in the framework of a differential conditioning procedure, it is possible to uncover discrimination capabilities that may remain otherwise unsuspected. We show, therefore, that visual discrimination is not an absolute phenomenon but can be modulated by experience.

  8. Understanding Interorganizational Learning Based on Social Spaces and Learning Episodes

    Directory of Open Access Journals (Sweden)

    Anelise Rebelato Mozzato

    2014-07-01

    Full Text Available Different organizational settings have been gaining ground in the world economy, resulting in a proliferation of different forms of strategic alliances that translate into a growth in the number of organizations that have started to deal with interorganizational relationships with different actors. These circumstances reinforce Crossan, Lane, White and Djurfeldt (1995 and Crossan, Mauer and White (2011 in exploring what authors refer to as the fourth, interorganizational, level of learning. These authors, amongst others, suggest that the process of interorganizational learning (IOL warrants investigation, as its scope of analysis needs widening and deepening. Therefore, this theoretical essay is an attempt to understand IOL as a dynamic process found in interorganizational cooperative relationships that can take place in different structured and unstructured social spaces and that can generate learning episodes. According to this view, IOL is understood as part of an organizational learning continuum and is analyzed within the framework of practical rationality in an approach that is less cognitive and more social-behavioral.

  9. Investigation of different carbon nanotube reinforcements for fabricating bulk AlMg5 matrix nanocomposites

    Energy Technology Data Exchange (ETDEWEB)

    Kallip, Kaspar, E-mail: kaspar.kallip@empa.ch [Empa, Swiss Federal Laboratories for Material Science and Technology, Laboratory for Advanced Materials Processing, Feuerwerkerstrasse 39, CH-3602 Thun (Switzerland); Leparoux, Marc [Empa, Swiss Federal Laboratories for Material Science and Technology, Laboratory for Advanced Materials Processing, Feuerwerkerstrasse 39, CH-3602 Thun (Switzerland); AlOgab, Khaled A. [King Abdulaziz City for Science and Technology (KACST), National Centers for Advanced Materials, P O Box 6086, Riyadh, 11442 (Saudi Arabia); Clerc, Steve; Deguilhem, Guillaume [Empa, Swiss Federal Laboratories for Material Science and Technology, Laboratory for Advanced Materials Processing, Feuerwerkerstrasse 39, CH-3602 Thun (Switzerland); Arroyo, Yadira [Empa, Swiss Federal Laboratories for Material Science and Technology, Electron Microscopy Center, Ueberlandstrasse 129, CH-8600 Dübendorf (Switzerland); Kwon, Hansang [Empa, Swiss Federal Laboratories for Material Science and Technology, Laboratory for Advanced Materials Processing, Feuerwerkerstrasse 39, CH-3602 Thun (Switzerland); Pukyong National University, Department of Materials System Engineering, 365 Sinseon-ro, Busan 608-739 (Korea, Republic of)

    2015-10-15

    AlMg5-based metal matrix composites were successfully fabricated using high energy planetary ball-milling and hot pressing. The influence of 6 types of carbon nanotubes (CNTs) with different properties was investigated for reinforcement. Over 3 fold increase in hardness and ultimate tensile strength was achieved with maximum values of 200 HV{sub 20} and 720 MPa respectively by varying CNT content from 0.5 to 5 vol%. The state, the dispersion as well as the reactivity of the different CNTs were investigated by Raman spectroscopy, X-Ray diffraction and microscopy. The CNTs were considered to be dispersed homogeneously, but were shortened due to high energy milling. No significant differences in mechanical performances could be observed depending either on the nature or on the agglomeration initial state of the investigated CNTs. The milling time has to be however adjusted to the CNT content as higher concentrations require a longer milling time for achieving dispersion of the nano-reinforcement. - Highlights: • CNTs sustained the milling process and became homogeneously dispersed. • 3 times strengthening over unreinforced alloy achieved. • Flexible processing route for dispersing wide range of nanoparticulate materials.

  10. When Learning Disturbs Memory – Temporal Profile of Retroactive Interference of Learning on Memory Formation

    Science.gov (United States)

    Sosic-Vasic, Zrinka; Hille, Katrin; Kröner, Julia; Spitzer, Manfred; Kornmeier, Jürgen

    2018-01-01

    Introduction: Consolidation is defined as the time necessary for memory stabilization after learning. In the present study we focused on effects of interference during the first 12 consolidation minutes after learning. Participants had to learn a set of German – Japanese word pairs in an initial learning task and a different set of German – Japanese word pairs in a subsequent interference task. The interference task started in different experimental conditions at different time points (0, 3, 6, and 9 min) after the learning task and was followed by subsequent cued recall tests. In a control experiment the interference periods were replaced by rest periods without any interference. Results: The interference task decreased memory performance by up to 20%, with negative effects at all interference time points and large variability between participants concerning both the time point and the size of maximal interference. Further, fast learners seem to be more affected by interference than slow learners. Discussion: Our results indicate that the first 12 min after learning are highly important for memory consolidation, without a general pattern concerning the precise time point of maximal interference across individuals. This finding raises doubts about the generalized learning recipes and calls for individuality of learning schedules. PMID:29503621

  11. When Learning Disturbs Memory – Temporal Profile of Retroactive Interference of Learning on Memory Formation

    Directory of Open Access Journals (Sweden)

    Zrinka Sosic-Vasic

    2018-02-01

    Full Text Available Introduction: Consolidation is defined as the time necessary for memory stabilization after learning. In the present study we focused on effects of interference during the first 12 consolidation minutes after learning. Participants had to learn a set of German – Japanese word pairs in an initial learning task and a different set of German – Japanese word pairs in a subsequent interference task. The interference task started in different experimental conditions at different time points (0, 3, 6, and 9 min after the learning task and was followed by subsequent cued recall tests. In a control experiment the interference periods were replaced by rest periods without any interference.Results: The interference task decreased memory performance by up to 20%, with negative effects at all interference time points and large variability between participants concerning both the time point and the size of maximal interference. Further, fast learners seem to be more affected by interference than slow learners.Discussion: Our results indicate that the first 12 min after learning are highly important for memory consolidation, without a general pattern concerning the precise time point of maximal interference across individuals. This finding raises doubts about the generalized learning recipes and calls for individuality of learning schedules.

  12. Adults Learn in a Different Way

    Directory of Open Access Journals (Sweden)

    Ema Perme

    1996-12-01

    Full Text Available Due to demand of praxis a new programme on a field of adult education has been created. The advisers at Job Centre in Maribor have namely established the fact that there is a great number of unemployed who take part in different educational programmes to become more competitive on labour market and whose motivation for further learning/education is on a very low level. The presence of fear in them can also be connected with the lack of knowledge of different learning techniques. • Adults Learn in a Different Way' is a programme designed to help those with motivation problems and/or problems with using appropriate learning techniques. During the 16 hour programme participants work on the following topics: • ways the adults learn, • the significance of different learning types, • importance of music for more successful learning, • strategies for making learning plan, • learning techniques with an emphasis on mindmaping, • how to define concrete learning goals, • how to reach goals concerning our own personal significance and abilities. Seven experimental realisations in the past year showed some very encouraging results. With the help of anonymous questionnaires and personal talks with participants 6 months after they had attended the programme we got first feedback information. All the participants find the programme useful and the content of it helpful for making their own learning plan and strategies. They are able to concentrate better, they are able to reach their learning goals step by step as planned and they would all recommend the programme to their friends and acquaintances.

  13. Multiple reversal olfactory learning in honeybees

    Directory of Open Access Journals (Sweden)

    Theo Mota

    2010-07-01

    Full Text Available In multiple reversal learning, animals trained to discriminate a reinforced from a non-reinforced stimulus are subjected to various, successive reversals of stimulus contingencies (e.g. A+ vs. B-, A- vs. B+, A+ vs. B-. This protocol is useful to determine whether or not animals learn to learn and solve successive discriminations faster (or with fewer errors with increasing reversal experience. Here we used the olfactory conditioning of proboscis extension reflex to study how honeybees Apis mellifera perform in a multiple reversal task. Our experiment contemplated four consecutive differential conditioning phases involving the same odors (A+ vs. B- to A- vs. B+ to A+ vs. B- to A- vs. B+. We show that bees in which the weight of reinforced or non-reinforced stimuli was similar mastered the multiple olfactory reversals. Bees which failed the task exhibited asymmetric responses to reinforced and non-reinforced stimuli, thus being unable to rapidly reverse stimulus contingencies. Efficient reversers did not improve their successive discriminations but rather tended to generalize their choice to both odors at the end of conditioning. As a consequence, both discrimination and reversal efficiency decreasedalong experimental phases. This result invalidates a learning-to-learn effect and indicates that bees do not only respond to the actual stimulus contingencies but rather combine these with an average of past experiences with the same stimuli.  

  14. Machine learning approach for the outcome prediction of temporal lobe epilepsy surgery.

    Directory of Open Access Journals (Sweden)

    Rubén Armañanzas

    Full Text Available Epilepsy surgery is effective in reducing both the number and frequency of seizures, particularly in temporal lobe epilepsy (TLE. Nevertheless, a significant proportion of these patients continue suffering seizures after surgery. Here we used a machine learning approach to predict the outcome of epilepsy surgery based on supervised classification data mining taking into account not only the common clinical variables, but also pathological and neuropsychological evaluations. We have generated models capable of predicting whether a patient with TLE secondary to hippocampal sclerosis will fully recover from epilepsy or not. The machine learning analysis revealed that outcome could be predicted with an estimated accuracy of almost 90% using some clinical and neuropsychological features. Importantly, not all the features were needed to perform the prediction; some of them proved to be irrelevant to the prognosis. Personality style was found to be one of the key features to predict the outcome. Although we examined relatively few cases, findings were verified across all data, showing that the machine learning approach described in the present study may be a powerful method. Since neuropsychological assessment of epileptic patients is a standard protocol in the pre-surgical evaluation, we propose to include these specific psychological tests and machine learning tools to improve the selection of candidates for epilepsy surgery.

  15. Reinforcement Learning Multi-Agent Modeling of Decision-Making Agents for the Study of Transboundary Surface Water Conflicts with Application to the Syr Darya River Basin

    Science.gov (United States)

    Riegels, N.; Siegfried, T.; Pereira Cardenal, S. J.; Jensen, R. A.; Bauer-Gottwein, P.

    2008-12-01

    In most economics--driven approaches to optimizing water use at the river basin scale, the system is modelled deterministically with the goal of maximizing overall benefits. However, actual operation and allocation decisions must be made under hydrologic and economic uncertainty. In addition, river basins often cross political boundaries, and different states may not be motivated to cooperate so as to maximize basin- scale benefits. Even within states, competing agents such as irrigation districts, municipal water agencies, and large industrial users may not have incentives to cooperate to realize efficiency gains identified in basin- level studies. More traditional simulation--optimization approaches assume pre-commitment by individual agents and stakeholders and unconditional compliance on each side. While this can help determine attainable gains and tradeoffs from efficient management, such hardwired policies do not account for dynamic feedback between agents themselves or between agents and their environments (e.g. due to climate change etc.). In reality however, we are dealing with an out-of-equilibrium multi-agent system, where there is neither global knowledge nor global control, but rather continuous strategic interaction between decision making agents. Based on the theory of stochastic games, we present a computational framework that allows for studying the dynamic feedback between decision--making agents themselves and an inherently uncertain environment in a spatially and temporally distributed manner. Agents with decision-making control over water allocation such as countries, irrigation districts, and municipalities are represented by reinforcement learning agents and coupled to a detailed hydrologic--economic model. This approach emphasizes learning by agents from their continuous interaction with other agents and the environment. It provides a convenient framework for the solution of the problem of dynamic decision-making in a mixed cooperative / non

  16. Experimental analysis of reinforced concrete beams strengthened in bending with carbon fiber reinforced polymer

    Directory of Open Access Journals (Sweden)

    M. M. VIEIRA

    Full Text Available The use of carbon fiber reinforced polymer (CFRP has been widely used for the reinforcement of concrete structures due to its practicality and versatility in application, low weight, high tensile strength and corrosion resistance. Some construction companies use CFRP in flexural strengthening of reinforced concrete beams, but without anchor systems. Therefore, the aim of this study is analyze, through an experimental program, the structural behavior of reinforced concrete beams flexural strengthened by CFRP without anchor fibers, varying steel reinforcement and the amount of carbon fibers reinforcement layers. Thus, two groups of reinforced concrete beams were produced with the same geometric feature but with different steel reinforcement. Each group had five beams: one that is not reinforced with CFRP (reference and other reinforced with two, three, four and five layers of carbon fibers. Beams were designed using a computational routine developed in MAPLE software and subsequently tested in 4-point points flexural test up to collapse. Experimental tests have confirmed the effectiveness of the reinforcement, ratifying that beams collapse at higher loads and lower deformation as the amount of fibers in the reinforcing layers increased. However, the increase in the number of layers did not provide a significant increase in the performance of strengthened beams, indicating that it was not possible to take full advantage of strengthening applied due to the occurrence of premature failure mode in the strengthened beams for pullout of the cover that could have been avoided through the use of a suitable anchoring system for CFRP.

  17. Instant transformation of learned repulsion into motivational "wanting".

    Science.gov (United States)

    Robinson, Mike J F; Berridge, Kent C

    2013-02-18

    Learned cues for pleasant reward often elicit desire, which, in addicts, may become compulsive. According to the dominant view in addiction neuroscience and reinforcement modeling, such desires are the simple products of learning, coming from a past association with reward outcome. We demonstrate that cravings are more than merely the products of accumulated pleasure memories-even a repulsive learned cue for unpleasantness can become suddenly desired via the activation of mesocorticolimbic circuitry. Rats learned repulsion toward a Pavlovian cue (a briefly-inserted metal lever) that always predicted an unpleasant Dead Sea saltiness sensation. Yet, upon first reencounter in a novel sodium-depletion state to promote mesocorticolimbic reactivity (reflected by elevated Fos activation in ventral tegmentum, nucleus accumbens, ventral pallidum, and the orbitofrontal prefrontal cortex), the learned cue was instantly transformed into an attractive and powerful motivational magnet. Rats jumped and gnawed on the suddenly attractive Pavlovian lever cue, despite never having tasted intense saltiness as anything other than disgusting. Instant desire transformation of a learned cue contradicts views that Pavlovian desires are essentially based on previously learned values (e.g., prediction error or temporal difference models). Instead desire is recomputed at reencounter by integrating Pavlovian information with the current brain/physiological state. This powerful brain transformation reverses strong learned revulsion into avid attraction. When applied to addiction, related mesocorticolimbic transformations (e.g., drugs or neural sensitization) of cues for already-pleasant drug experiences could create even more intense cravings. This cue/state transformation helps define what it means to say that addiction hijacks brain limbic circuits of natural reward. Copyright © 2013 Elsevier Ltd. All rights reserved.

  18. [Effects of prefrontal ablations on the reaction of the active choice of feeder under different probability and value of the reinforcement on dog].

    Science.gov (United States)

    Preobrazhenskaia, L A; Ioffe, M E; Mats, V N

    2004-01-01

    The role of the prefrontal cortex was investigated on the reaction of the active choice of the two feeders under changes value and probability reinforcement. The experiments were performed on 2 dogs with prefrontal ablation (g. proreus). Before the lesions the dogs were taught to receive food in two different feeders to conditioned stimuli with equally probable alimentary reinforcement. After ablation in the inter-trial intervals the dogs were running from the one feeder to another. In the answer to conditioned stimuli for many times the dogs choose the same feeder. The disturbance of the behavior after some times completely restored. In the experiments with competition of probability events and values of reinforcement the dogs chose the feeder with low-probability but better quality of reinforcement. In the experiments with equal value but different probability the intact dogs chose the feeder with higher probability. In our experiments the dogs with prefrontal lesions chose the each feeder equiprobably. Thus in condition of free behavior one of different functions of the prefrontal cortex is the reactions choose with more probability of reinforcement.

  19. Behavior of reinforced concrete beams reinforced with GFRP bars

    Directory of Open Access Journals (Sweden)

    D. H. Tavares

    Full Text Available The use of fiber reinforced polymer (FRP bars is one of the alternatives presented in recent studies to prevent the drawbacks related to the steel reinforcement in specific reinforced concrete members. In this work, six reinforced concrete beams were submitted to four point bending tests. One beam was reinforced with CA-50 steel bars and five with glass fiber reinforced polymer (GFRP bars. The tests were carried out in the Department of Structural Engineering in São Carlos Engineering School, São Paulo University. The objective of the test program was to compare strength, reinforcement deformation, displacement, and some anchorage aspects between the GFRP-reinforced concrete beams and the steel-reinforced concrete beam. The results show that, even though four GFRP-reinforced concrete beams were designed with the same internal tension force as that with steel reinforcement, their capacity was lower than that of the steel-reinforced beam. The results also show that similar flexural capacity can be achieved for the steel- and for the GFRP-reinforced concrete beams by controlling the stiffness (reinforcement modulus of elasticity multiplied by the bar cross-sectional area - EA and the tension force of the GFRP bars.

  20. Skill Learning for Intelligent Robot by Perception-Action Integration: A View from Hierarchical Temporal Memory

    Directory of Open Access Journals (Sweden)

    Xinzheng Zhang

    2017-01-01

    Full Text Available Skill learning autonomously through interactions with the environment is a crucial ability for intelligent robot. A perception-action integration or sensorimotor cycle, as an important issue in imitation learning, is a natural mechanism without the complex program process. Recently, neurocomputing model and developmental intelligence method are considered as a new trend for implementing the robot skill learning. In this paper, based on research of the human brain neocortex model, we present a skill learning method by perception-action integration strategy from the perspective of hierarchical temporal memory (HTM theory. The sequential sensor data representing a certain skill from a RGB-D camera are received and then encoded as a sequence of Sparse Distributed Representation (SDR vectors. The sequential SDR vectors are treated as the inputs of the perception-action HTM. The HTM learns sequences of SDRs and makes predictions of what the next input SDR will be. It stores the transitions of the current perceived sensor data and next predicted actions. We evaluated the performance of this proposed framework for learning the shaking hands skill on a humanoid NAO robot. The experimental results manifest that the skill learning method designed in this paper is promising.