Page 1 of 3 1 2 3 LastLast
Results 1 to 10 of 25

Thread: Creep Block Test Scenario for Episodic Reinforcement Learning

  1. #1

    Dota 2 Reinforcement Learning - Creep Block

    About
    The challenge of creating a bot for Dota 2 is the vast amount of information available at every frame, and the continuous set of actions possible.

    I am currently investigating what sort of model can effectively tackle the problem above. I have tinkered with all sorts of models, and my most successful one as seen below is my continuous policy one.

    I constrain myself to a creepblock scenario as it is a simple enough test for whether or not my model is capable of learning to extract the appropriate features from the data, and map it to an optimal policy. The objective is to block the creeps as much as possible. Every 5 episodes, I use a hardcoded bot to "bootstrap" the training in an effort to get the model to learn faster.

    Video



    Model
    My model consists of an online policy network, and target policy network.

    The networks take as input the (x,y) offset of the creeps relative to the hero as State.
    The networks output 20x 2D normal distributions from which I sample the (x,y) offset my hero should move. Actions are performed every 0.2s.

    The dota 2 addon (i.e. the bot in the game) runs the target policy network which gathers the states, action, and reward. This experience is then passed to a webservice integrated with tensorflow to train my online policy network.

    The tensorflow component consists of the online policy network and online value network. The experience is used to train these 2 networks.

    Every 10 episodes, the target policy network is replaced with the online policy network.

    Link to Code
    https://github.com/BeyondGodlikeBot/CreepBlockAI

    Edit: Replaced original post..
    Last edited by BeyondGodlikeBot; 08-24-2017 at 06:18 AM.

  2. #2
    Basic Member
    Join Date
    Dec 2016
    Posts
    731
    It's a good start honestly, but aren't you only going to train it to creep block under the situation of no harassment, no friendlies in the way, and that start position?

  3. #3
    Yes to all those questions. You've goto realize that actually implementing even the state of the art reinforcement learning to solve this problem isn't a trivial task.

    Even if I can further constrain my simulator to have the same wave each time, I still need my model to learn an optimal sequence of actions for a 17^80 state space.. Multiply this by the number of variations that are possible, and you can see how huge a problem this is*. This is all already under lots of assumptions to make it easier for myself: i.e. discretizing a continuous markov decision process, discretizing a continuous action space.

    *As a side note this is why 5v5 is laughably unsolvable at the moment. Each dimension on top is at minimum a multiplicative factor which blows the state space to near infinity.

    Also, if you've ever done reinforcement learning research, you'll quickly find out the biggest spanner you can throw in the works is to have your model be able to handle variable number of inputs/outputs.. Long story short, when you are training a model, you are effectively learning a set of parameters which will give the best outputs (based on your loss function which is reward based in RL). You cannot just add a new input because that input will mean new parameters which have not been optimized. Techniques for handling a variable number of inputs are just another layer of unnecessary complexity.

    Further to all this, there is no gaurentee whatever model you test is going to be right from the very start. Solving these sort of problems usually requires hyperparameter tuning which takes a long time.

    I am sure OpenAI also gave themselves a huge range of handy handicaps to make the problem actually solvable with the RL techniques we currently know. They also have the added benefit of having clusters available to them.. They can probably run a large number of dota 2 instances in parallel which greatly speeds up the learning..
    Last edited by BeyondGodlikeBot; 08-17-2017 at 12:39 AM.

  4. #4
    Basic Member
    Join Date
    Dec 2016
    Posts
    731
    Right, I follow you, which makes me ask the question - why do it that way?

    Why have the bot "learn" how to creep block when it is much easier to just code it? I would think "learning" the "when to creep block" is much more important then "how to creep block". I'm not deeply engaged in Reinforcement Learning or Machine Learning currently, but have a passing interest/curiosity in it. So I apologize if I spew garbage from my mouth, but why is everyone and their mother so hell-bent on having the bot(s) learn everything from scratch and then give up and say "too difficult with current hardware / processing limitations"?

    I would think "learning" over a layered Dota 2 win-plan that can be deconstructed to any number of deeper layers would be so much easier and way better. What I mean by this (remember, I'm not versed in ML or RL so my verbiage and use of certain AI keywords might be incorrect, but hopefully I get the meaning across) is:

    Global ProbabilityOfWin = 0.0%

    Objective_1: Win Dota 2 Game
    Requirements: Enemy Ancient Destroyed
    Effect: IfCompleted: ProbabilityOfWin = 100%, IfNotCompleted: None
    Action: None

    Objective_2: Destroy Enemy Ancient
    Requirements: Enemy T4 Tower #1 Destroyed, Enemy T4 Tower #2 Destroyed, Friendly Hero Near Ancient, Enemy Ancient Not Glyphed
    Effect: IfCompleted: ProbabilityOfWin = +100%, IfNotCompleted: None
    Action: Attack Enemy Ancient

    Objective_3: Destroy Enemy Top Melee Barracks
    ...
    Effect: IfCompleted: ProbabilityOfWin = +20%, IfNotCompleted: None
    Action: Attack Enemy Top Melee Barracks

    Objective_4: Destroy Enemy Top Ranged Barracks
    ...
    Effect: IfCompleted: ProbabilityOfWin = +10%, IfNotCompleted: None
    Action: Attack Enemy Top Ranged Barracks

    Objective_5: Destroy Enemy Top T3 Tower
    ...

    ...

    Objective_N: Defend Friendly Ancient
    ...
    Effect: IfCompleted: ProbabilityOfWin = None, IfNotCompleted: ProbabilityOfWin = -60%
    Action: Clear Enemy Creep AND Clear Enemy Heroes

    Etc...

    We could start very high-level and abstract in our plan composition and eventually we would have logical ANDs and ORs (for example: to kill Tier 4 you have to have one of the T3 towers dead, but not all; however, having more increases chance of achieving objective). The "AI" part would come in what the appropriate traversal of the plan composition tree would be as the game progresses. It could monitor the total net value and lane positioning and determine which next objective is the best one to complete. How to execute the actions would be hardcoded via algorithms (which can be flexible internally and not rigid) but not something we "learn".

    Just my thoughts..

  5. #5
    I just posted a response which partly answers your questions in the OpenAI thread, but I'll put a few more details here

    Quote Originally Posted by nostrademous View Post
    Why have the bot "learn" how to creep block when it is much easier to just code it? I would think "learning" the "when to creep block" is much more important then "how to creep block".
    Creep Blocking is just an initial proof of concept problem. I think it would be rather over-optimistic to dive right into the deep end and implement a 1v1 bot straight off. By limiting myself to a simple problem first, I can more easily diagnose problems and hopefully gain insight into what may be feasible for the bigger problem.

    As to the rest of your post:
    1. ML research don't aim for hardcoded algorithms because the goal is to develop machine learning algorithms which can mimic how people think. These algorithms will be broadly applicable to all sorts of areas outside of Dota 2. Also, we want to find the optimal Policy/Strategy which cannot be reasoned by humans for a problem of this size. i.e. we want ML to produce a perfect dota player.

    2. We want to define the objective as loosely as possible. Ideally we just want to tell the algorithm that killing the throne is the goal, and it needs to go figure out the rest. By putting other stuff in the objective, we are more likely to steer it away from the optimal optimal Policy. (e.g. if we put too much weight on killing a tower, it may learn to suicide constantly against the tower until it is destroyed). Also, it is impossible for us to say what is the appropriate importance needed to be given to each sub-objective.
    Note that despite what I said in point 2, algorithms do commonly use sub-objectives because it is hard for an algorithm to associate what action contributed towards achieving that main objective if the sequence of actions is too long

  6. #6
    Basic Member
    Join Date
    Dec 2016
    Posts
    731
    Quote Originally Posted by BeyondGodlikeBot View Post
    Creep Blocking is just an initial proof of concept problem. I think it would be rather over-optimistic to dive right into the deep end and implement a 1v1 bot straight off. By limiting myself to a simple problem first, I can more easily diagnose problems and hopefully gain insight into what may be feasible for the bigger problem.
    So your approach is bottom-up. I simply suggest that perhaps a more rewarding approach might be top-down. I could be wrong, but just thinking out loud.

    Quote Originally Posted by BeyondGodlikeBot View Post
    1. ML research don't aim for hardcoded algorithms because the goal is to develop machine learning algorithms which can mimic how people think. These algorithms will be broadly applicable to all sorts of areas outside of Dota 2. Also, we want to find the optimal Policy/Strategy which cannot be reasoned by humans for a problem of this size. i.e. we want ML to produce a perfect dota player.
    I am not suggesting we hardcode them forever, just for now while we are learning the "top" strategy layers. Eventually we can improve the lower layers through ML.

    Quote Originally Posted by BeyondGodlikeBot View Post
    2. ... Also, it is impossible for us to say what is the appropriate importance needed to be given to each sub-objective. ...
    That's what I was suggesting we use ML for... determine what the appropriate transition values are when we are at some specific evaluated state and the several possible actions.

  7. #7
    Basic Member
    Join Date
    Dec 2016
    Posts
    731
    I looked at your code and I would suggest you make one tiny change.

    On line #99 of: game/dota_addons/creepblockai/scripts/vscripts/addon_game_mode.lua
    Code:
    		hero:MoveToPosition(position + 50*directions[command])
    Change the 50 to "hero:GetBaseMoveSpeed()/5.0" (I use 5.0 b/c you seem to run at 0.2 events per second). Obviously if you pick up movement modifiers items ever in your scenario you would have to account for them as well, but for now base movement speed will work. Not sure why you used 50 as Nevermore has 315 base movement speed (so 63 in 0.2 time)

  8. #8
    Good start to bottom up approach. Just finished setting this up, Couple of questions
    SendToServerConsole( "dota_creeps_no_spawning 1" )
    SendToServerConsole( "dota_dev forcegamestart" )

    Seems to get ignored/cheat protected. Any way around?

  9. #9
    Quote Originally Posted by nostrademous View Post
    I am not suggesting we hardcode them forever, just for now while we are learning the "top" strategy layers. Eventually we can improve the lower layers through ML.

    That's what I was suggesting we use ML for... determine what the appropriate transition values are when we are at some specific evaluated state and the several possible actions.
    I see. Sure that is a perfectly ok way to approach it. I think it may either provide a good estimate of the ideal transitions or a wildly inaccurate one depending on how close to optimal is the Bot's hardcoded side.

    Quote Originally Posted by nostrademous View Post
    Change the 50 to "hero:GetBaseMoveSpeed()/5.0" (I use 5.0 b/c you seem to run at 0.2 events per second). Obviously if you pick up movement modifiers items ever in your scenario you would have to account for them as well, but for now base movement speed will work. Not sure why you used 50 as Nevermore has 315 base movement speed (so 63 in 0.2 time)
    Thanks for pointing this out, its highly appreciated (there's likely a lot more small things that can be fixed). I'm still making a lot of changes to the code as I setup my ML training. Bit of a trial and error learning process.


    @EpiphanyMania try adding another console command "sv_cheats 1"
    Last edited by BeyondGodlikeBot; 08-17-2017 at 12:26 PM.

  10. #10
    Since you are new here, let me tell you that you are the 3rd person I khow who is trying to do this and the previous 2 attempts did not have good results. (And I am with Nos on this one)

    Why learning simple mechanichs (even if it is doesn't fail) won't be useful?

    1. It needs a lot of computational power. Even if you manage to write a code that performs as good as open AI, the are using a lot of computational power that strict majority of people who want to use the bots don't have.

    2. The end result will be barely better (if not worse) than hard coded scripts. All simple tasks can be done near optimally with some dota knowledge and scripts. Open AI's sf gets 97 creeps in 10 minutes, mine gets 98 (as well as 48 denies). And my code works for any hero, while theirs should be trained for each hero again. Note that they used a GPU cloud while I used a laptop that is falling apart! (Also I have dumbed down my code to make it run faster!). This is like killing a fly with a nuclear bomb!

    Why should I use machine learning then?

    1. Machine learning is a tool, nothing more. One way it is useful is for cases where optimal policy is not clear. I can tell you exactly what you should do to last hit/block creeps optimally, but does anyone know what is the optimal strategy for playing 5v5 dota? Does it even exist? It is most likely like Rock Paper Scissors, in which pure strategy Nash equilibrium doesn't exist.

    2. It is extremely hard to write a script (if possible at all) to write a script that beats good (5-6k+) players in the strategy level.

    For these reasons I think the best approach is to rely on ML for choosing the strategies and hardcode the rest. BTW Even open AI confessed the current ML technicies should be improved for the 5v5 game. Heck, they had to hardcode bunch of things for 1v1 sf vs sf with a GPU cloud and $1b budget, how optimistic are you to think your code can learn everything?!

    Finally, figuring out a way to use ML techniques as a tool alongside hard coded scripts is not less (possibly even more) valuable that rely on learning for everything. If you manage to find a way to make a team that knows how to do bunch of things (the hard coded parts) learn a really good strategy I can think of a lot of useful non dota related applications for it!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •