Page 3 of 4 FirstFirst 1 2 3 4 LastLast
Results 21 to 30 of 38

Thread: open-ai's 1v1 mid bot code open source?

  1. #21
    They don't need to have access to creep target data. It is quite simple to infer in a reinforcement learning context:

    Let's say the bot is give reward for reducing the enemy hero's hp, and has reward deducted for losing its own hp.
    Now we throw this bot out there into the wild, and tell it to find a sequence of actions that will produce the greatest overall reward.

    Understandably, the bot will learn that by attacking the enemy hero, then attacking your own creep, and repeating will consistently produce the highest rewards.

  2. #22
    Quote Originally Posted by BeyondGodlikeBot View Post
    They don't need to have access to creep target data. It is quite simple to infer in a reinforcement learning context:

    Let's say the bot is give reward for reducing the enemy hero's hp, and has reward deducted for losing its own hp.
    Now we throw this bot out there into the wild, and tell it to find a sequence of actions that will produce the greatest overall reward.

    Understandably, the bot will learn that by attacking the enemy hero, then attacking your own creep, and repeating will consistently produce the highest rewards.
    Alright, here we go:

    AFAIK, any learning model (applicable to this problem) includes 3 elements:

    - State: a time frame (might be defined not just with clock, but events) and a set of variables that the algorithm should consider like your hp, enemy hp, creeps hp, distances etc.

    - A set of actions defined for each state: you can define each action to be a set of small commands or a single command. You can hardcode some sequences of small commands (like go there then attack), you can just have a singleton (go to x or attack y) or a sequence of completely random commands (for instance any sequences of 3 small commands).

    - A reward/scoring/objective function: given a state it tells the algorithm how good that state is

    What learning algorithms do is that they look at your current state, choose an action (this is what they learn, let's call it the algorithm's behavior) and look at the reward of the final state (or how much the score changed, the derivative).

    Now, how can you tell if a claim is BS? Here are some ways:

    1. If the behavior has to use something that isn't defined in each state, but the claim says it doesn't. Example: if you can't see the creep health, you cannot possibly learn to last hit decently. If you don't have maphack, you cannot always follow the enemy hero correctly. If you can't see whether the creeps want to attack you or not, you cannot learn to go back when they are trying attack you (agro creeps). However, you might have enough information that indirectly tells you the creeps will attack you soon (will go back to this later).

    2. If they claim that the algorithm learns a behavior that doesn't increase the reward/score of the current state. Example: if your bots are blocking, and your state is defined only over a 5 second period and the fact that "blocking is good" is not part of your objective, you cannot possibly learn to block right after the creeps respawn.

    3. If the actions/states space is "too big" but the claim says the algorithm learns/runs fast: Example: if in each state an action can consist of 20 small commands that have been chosen randomly (i.e. they are not handpicked/scripted) and you have 20 types of commands, then the action space consists of 20^20 (around 10^26) actions. So unless these actions and/or their outcome have insanely nice properties (like a lot of these command sequences have the same outcome, a decent fraction of them is optimal or the scoring functions and/or its derivative is concave/continuous/monotone) learning something close to optimum will not be possible in feasible time, even with a super computer.

    Now let's look at the claims made by OpenAI:

    - The actions are chosen randomly: from 3 and the number of commands per second from the bot, we get that the state cannot contain more than a few second.

    - Algorithm learned how to block: from 2 and the previous line, we get that this is BS. Either they have hardcoded it in their scoring function to delay the creeps from reaching the center of the lane (by having something like abs(GetLaneFrontAmount(otherteam,LANE_MID) - (1 - GetLaneFrontAmount(team,LANE_MID)+0.02)) in the scoring function for the first X seconds) or it has been hardcoded and optimized by learning.

    - Algorithm learned to agro creeps: agroing creeps means deliberately attacking (or pretending to attack) enemy hero to reposition lane creeps to LH/Deny better in future. This can be misunderstood with attacking hero and going back if creeps are attacking you, but they are completely different concepts. By definition, if the scoring function isn't hardcoded to consider creeps positioning, or the state is not big enough to capture the concept of "future", learning creep agro is meaningless. On top of that, since you need the information that whether creeps are attacking you, again, you have to have a lot of variables (creeps positions, their movement pattern, animation, their facing etc) in your state (this doesn't mean they can't learn it or these are too many variables, but just makes me think that they probably used a custom API, which may or may not be true and is not that important tbh).

    - Last hitting: easy to learn, if you include all the relevant parameters in the state (which aren't that man).

    - Algorithm learned to "fake" razes: the wording is misleading: the algorithm learned if the enemy went out of the spell radius it should stop the animation since its score goes down (has less mana and one less spell off cooldown). "faking" means you should cancel your anmiation even if the enemy doesn't move, and learning it requires the state to contain what happens in future (which will make it too big).

    - Algorithm learned to "dodge" razes: similar to the previous one. Both of these are simple concepts that can be learned by most learning algorithms (given enough information).

    - Algorithm learned to "zone" enemies: not hard to learn, given that in SF vs SF it is extremely easy to figure out you have an advantage in lane.

    - We didn't hard-code any strategy: Hard-coding a strategy in the objective/scoring function is still hard-coding a strategy! The degree at which the algorithm learned things is highly exaggerated.

    - Our mission is to make these available to everyone: where is your code then?

    - This is an important step towards learning complicated tasks like being a surgeon: B**ch pls!

    Final words: These are my (and some of my friend's) thoughts. There are three parts to this story. One: it is definately cool to see someone/a team managed to learn simple mechanics almost to perfection. Two: the results were extremely over exaggerated (especially given their budget and resources) and the show was trying to push an agenda that can be really harmful to the AI/Learning community in near future (read this and this) (and they want the money of course). Three: Valve sold us out, even ignoring this, if you take a look at the title of articles that were published about this (and the tweet), you'll realize how this is being advertised as "OpenAI crushed/destroyed Dota2/esports players". Given their incentives/agenda, this was completely predictable. The last one bothers me the most.

    This is my last comment on this topic BTW, sorry if it became too long and/or some part of it is not clear enough!

  3. #23
    Basic Member aveyo's Avatar
    Join Date
    Aug 2012
    Location
    EU West
    Posts
    2,927
    Quote Originally Posted by Platinum_dota2 View Post
    Alright, here we go:
    On the topic of B**ch pls! - counter strike podbot was pwning pro's as early as y2k. Now look at soviet script kiddies doing auto-stacking and skill-shot-evading (among other things like hex you from fog ) in 5kb js scripts..
    I'm fairly certain open-ai is still remarkable and has great potential, even if the authors exaggerated some of the training it has got.
    I mean - common! Be reasonable. Only life can evolve on it's own. Even your brain got years of training by imitation to be able to post on this thread. Cut open-ai some slack.

    And I've forgot about your "Valve sold us out" - nope, open-ai will replace the currently underwhelming bots in HL3. Valve is playing the long game

  4. #24
    If I knew about this blog post, I wouldn't write that giant response! It is from someone who knows what he is talking about (except he doesn't know much about Dota and its mechanics) and is a good read (with less salt compared to what I wrote ). I strongly suggest reading it.

  5. #25
    Basic Member aveyo's Avatar
    Join Date
    Aug 2012
    Location
    EU West
    Posts
    2,927
    open ai new blog - so they admit they've done small amounts of coaching (actually, lots of it) but the bot learning was self-play as stated.
    Pajkatt is a bug! SumaiL too. But we already know how talented these guys are.
    And it's confirmed they've used the bot api (and then some.. undisclosed things I've hinted above), and not some lame red&blue pixel scanner aimbot like some random redditard stated.

  6. #26
    https://blog.openai.com/more-on-dota-2/ was using bot API

    quite nice to know
    really good as it means if they approach 5v5 using API, it should end up with lots of functionality we may not have realised we needed yet

  7. #27
    They could just write this before/right after the event (and be more humble/honest during the show) and avoid all the criticisms.

    To people at OpenAI: I do this for fun (Since I mainly work on advanced algorithms and not ML), but I have a model that I believe can solve the 5v5 problem (I have also consulted few of my friends who mainly do research on advanced ML). I'm still implementing it, but since I was a bit of an ass towards you, I am willing to share my ideas with you (if you promise me to make the final code public and release a version that can be played by anyone online eventually). I would like to keep my identity private, so if you are interested send me a private massage in this forum.

  8. #28
    I tested SF last hitting with my current generic laning script (I did not make any adjustments for sf whatsoever). I thought it would be interesting for people to see the result since the recent blog states:

    "The scripted bot reaches 70 last hits in ten minutes on an empty lane, but still loses to reasonable humans. Our current best 1v1 bot reaches more like 97 (it destroys the tower before then, so we can only extrapolate), and the theoretical maximum is 101."

    Here are 2 screenshots from the very first trial:
    sf1.pngsf3.png

    Edit: BTW. the SF does not use any spells (I just added the hero to test this and didn't write its spell usages).
    Last edited by Platinum_dota2; 08-17-2017 at 07:40 AM.

  9. #29
    Basic Member
    Join Date
    Dec 2016
    Posts
    733
    Quote Originally Posted by Platinum_dota2 View Post
    I tested SF last hitting with my current generic laning script (I did not make any adjustments for sf whatsoever). I thought it would be interesting for people to see the result since the recent blog states:

    "The scripted bot reaches 70 last hits in ten minutes on an empty lane, but still loses to reasonable humans. Our current best 1v1 bot reaches more like 97 (it destroys the tower before then, so we can only extrapolate), and the theoretical maximum is 101."

    Here are 2 screenshots from the very first trial:
    sf1.pngsf3.png

    Edit: BTW. the SF does not use any spells (I just added the hero to test this and didn't write its spell usages).
    Very nice, you are making me want to go back to coding bots...

  10. #30
    I think I should provide some insight into what machine learning research (including OpenAI) is actually trying to do:

    Markov Decision Processes
    Dota 2, and pretty much anything humans do that takes place over time is what we call a Markov Decision Process. There are 3 key components: State, Action, and Policy.
    - State is everything which we can observe at a given point in time (if we freeze the world, what can we see)
    - Action is what we decide to do at that point in time (note it is perfectly feasible to do nothing)
    - Policy can be understood to be the "strategy". Given a State, the Policy is how we decide on the Action to take. An optimal Policy is one where we will end up with the most utility (this does not mean necessarily win, it can also just mean the least worst outcome)


    Any bot we are trying to build is basically implementing a Policy. For simple game like Tic-Tac-Toe, it is perfectly feasible and far more practical to just use if else statements to hardcode the Policy for every possible State. This can also be true for large problems. However, the issue is that the State space is often infinitesimal which raises many issues such as:
    - We cannot know what is the optimal Policy, so the best we can do is guess
    - We cannot just break the problem down into small sub-problems and solve them individually to get an optimal Policy. This is because we are basically ignoring all the ways those sub-problems are coupled together
    - The amount of information at every State is often so much, how do we even pick out what is important
    - and many more

    What machine learning research is trying to do is to build General Artificial Intelligence which captures the insane reasoning, imagination and generalization abilities of humans. If we sit a person who has never touched Dota 2 ever in his life and tell him to play the game, he will very quickly work out a very noobish/sub-optimal but probably successful Policy. How the heck do we do that?? What goes on in our mind??

    Reinforcement Learning reckons we can mimic it partly by feeding Rewards which encourages certain Actions.. That is what OpenAI did with its bot, plonk it into Dota 2, tell it the objective is to win according to the 1v1 rules, and then feed it rewards when it performs Actions which gets it closer to this objective. We don't teach it the rules, it manages to infer it.

    Even with OpenAI's success, it probably does not offer much in terms of new machine learning research. There is an inkling of hope that its success means that there is some new technique which allows us to tackle Continuous Markov Decision Processes (i.e. instead of being nicely limited to performing an action every x seconds, you can perform as many as you want whenever you want!). Also, it could possibly mean that we have some new technique to mimic how the human brain manages to very quickly focus on the important information in a given State and discard the rest.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •