Sam's Technical Blog: Sen et al (2007) Learning to Identify Beneficial Partners

Cited by 4 [ATGSATOP]

So this is another paper in my attempt to finish the background reading for an invited paper in the AP2PC'07 workshop proceedings. I believe I found this one following a citation trail from Ben-Ami and Shehory (2007) and I think I grabbed it because it had "learning" in the title. Peer to Peer is mentioned in passing, but this paper is really about a multi-agent system where individual agents have learning capabilities. I know the first author from a panel session in AP2PC'05, so that is another connection, but I can't really remember if I had some more complex motivation for printing out this particular paper last October.

In principle I am reading this to help illuminate some of the ways that agent research can be of benefit to P2P researchers, but there is a part of me that is just interested in mathematical and algorithmic characterizations of "learning". The paper itself introduces parallels between human and artificial agents trying to make critical choices about interaction partners; and this makes me think of the human interaction analogies in Ian Clarkes Masters thesis on Freenet, my own intuitions about pruning search in my NeuroGrid system, as well as the agent modelling in the paper I co-authored with Ben Tse and Raman Paranjape. We are all humans and we interact with other humans most days, and so I guess it is no surprise that this sort of analogy crops up again and again; however I think there is a pitfall here. Sometimes the analogies breakdown and our intuitions lead us astray - I think this is the case with mobile agents where our human experience of the greater efficiency of human face to face interaction suggests that sending a mobile agent across a network should be more efficient than static agents communicating with each other when in fact it is difficult to predict the relative efficiency of the two methods for mobile agents (Joseph & Kawamura, 2001).

The goal of the authors research is to try and discover which learning schemes will sustain mutually beneficial partnerships between agents. Apparently algorithms which achieve equilibria in repeated play have so far been restricted to two-player situations. This paper examines a population of agents that learn through the reinforcement technique of Q-learning (Watkins & Dayan, 1992). The authors restrict their system to one where agents search through repetitive personal interaction; not through referral.

In the authors system each agent is of a particular type, and has preferences to interact with agents of other types. Thus the potential reward that agents achieve through interacting with each other is a matrix of agents against types; and the matrix is designed such that some optimal solution of agent partnerships exists where no agent can get a greater reward by switching to interact with other agents. Since the matrix of rewards is unknown to the agents, the Q-learning technique is used to update agents estimates of the rewards of interacting with each other. Q-learning updates estimates through a combination of earlier experiences of reward with the current experience. The extent to which experience influences current estimates is not varied, and the alpha parameter that determines this is not mentioned again in the paper, making a replication difficult. However, in order to vary the agents level of explorative behaviour, i.e. the extent to which agents try out new interactions, the authors adjust the probability with which the agents select a random agent to interact with, rather than the one recommended by the Q-learning estimate. By adjusting this probability over time exploratory behaviour is gradually reduced in what seems like a sort of simulated annealing.

No particular justification is given for the particular parameter settings or the approach used, leading me to wonder what the basis for this approach is. Are these techniques similar to others in the literature, or are they based on any empirical observations of real-world phenomena? Nonetheless, initial simulation results in a static environment show that a slow decay of exploratory behaviour is associated with the system taking longer to achieve equilibrium, but also with a higher final average payoff for the agents. This certainly makes intuitive sense.

In subsequent simulations dynamic environments are explored where agents die off and are replaced if they fail to achieve a sufficiently high payoff within a certain timeframe. As the environment becomes tougher and agents are killed off more quickly it takes longer and longer for the system to reach a stable equilibrium, although this can be mitigated by reducing the level of exploratory behaviour (see figure d is rate at which exploratory behaviour decays). Again this makes intuitive sense.

In further simulations we see that protecting young agents can also help the system achieve equilibrium sooner, which also makes intuitive sense, and makes me think of the use of karma in online communities; or at least the way that new users will be given an initial chunk of reward or karma points. Not sure how strong the parallel is here, but I guess you could model an online community in terms of multi-agents looking for beneficial interactions. New users are entering the community at a certain rate, and not hanging around indefinitely. They will need to have positive interactions within a certain time period before they will effectively remove themselves from the community; which makes me think of that paper that shows the effect of existing social network patterns on incoming users (wasn't it something to do with closed triangular relations) - should re-read that for my thesis project, if I can find it (probably in disCourse somewhere).

In a final section the authors experiment with introducing noise and we see that noise can have a similar effect to prolonging exploratory behaviour, i.e. taking longer to get to equilibrium, but perhaps finding a higher optimum. My main concern with all this is the relationship to the real world where systems may spend much of their time away form equilibrium. I see connections to other work that I have done on the evolution of intelligence (where we compared our models with populations of animals in the real world) and online communities, but a lot of the modeling decisions seem to be soewhat arbitrary. It would be nice to know what was motivating them.

My references:

Joseph S. & Kawamura T. (2001) Why Autonomy Makes the Agent. In Agent Engineering, Eds. Liu, J, Zhong, N, Tang, Y.Y. and Wang P. World, Scientific Publishing.

Sandip Sen, Anil Gursel, & Stephane Airiau (2007). Learning to identify beneficial partners Working Notes of the Adaptive and Learning Agents Workshop at AAMAS

Thursday, July 2, 2009

Sen et al (2007) Learning to Identify Beneficial Partners

Sam's Technical Blog

Followers

Sam's Academic Research Papers

Blog Archive

About Me