Page 2 of 3 FirstFirst 1 2 3 LastLast
Results 11 to 20 of 24

Thread: Anyone tracking Skill by Match_ID?

  1. #11
    Basic Member
    Join Date
    Mar 2013
    Posts
    28
    After a few hours of the scraper running, I'm finding that I definitely can't keep up tracking all games, so I'm stuck only getting High/VHigh (leaving a likely false assumption that the ones I don't get are Medium).

    I feel pretty confident that I'm missing some High games (due to the issue described above), but I think I have all the very high. What's really unfortunate about this though, is that the games I'm missing are going to be games that are far longer on average than the games I'm successfully scraping. This will cause all sorts of inaccuracies in my calculations.

    How did you account for the fact that your "500 game block" returned by GetMatchHistory is weighted towards shorter games and isn't static, Phantasmal?

    EDIT: Durr, I suppose you used historical data. Any scraping of past data will give you a static 500 game block that isn't weighted towards shorter games. I guess my new question is: Do you have any ideas how I can account for the fact that my 500 game blocks are weighted towards shorter games and aren't static?
    Last edited by Aardvarki; 03-13-2013 at 11:19 AM. Reason: I'm Dumb

  2. #12
    Basic Member
    Join Date
    Feb 2012
    Posts
    57
    Quote Originally Posted by Aardvarki View Post
    How did you account for the fact that your "500 game block" returned by GetMatchHistory is weighted towards shorter games and isn't static, Phantasmal?
    I don't. I hadn't actually tested live recording, partially because I was afraid I would run into this issue which would ruin its value for what I'm trying to do. Instead, what I've been doing is sample creation by settling for the most 500 recent games of each particular day in all three brackets. I haven't noticed a weighting towards shorter games, possibly because it shouldn't matter how long the game goes on as long as it was one of the last 500 games created on that particular day.

    If High is incomplete, and incomplete in a way that specifically discounts what tend to be the most important matches, then getting complete skill data might be impossible, at least through this method. For some particular 24 hour period, do you have the number of games recorded through Sequence and the number of High and Very High games recorded through timed scraping?

    Edit: Yeah, I can't really see a way around that. For every bracket, there's going to be a match duration that's going to fall out of the 500 most recent matchIDs before it actually gets recorded. For Very High this duration might be long enough that the matches of that length are a negligible loss. If that's the case you might be best off settling for having Very High and not-Very High collections.
    Last edited by Phantasmal; 03-13-2013 at 11:42 AM.

  3. #13
    Basic Member
    Join Date
    Mar 2013
    Posts
    28
    Quote Originally Posted by Phantasmal View Post
    If High is incomplete, and incomplete in a way that specifically discounts what tend to be the most important matches, then getting complete skill data might be impossible, at least through this method. For some particular 24 hour period, do you have the number of games recorded through Sequence and the number of High and Very High games recorded through timed scraping?
    I don't (presently). I just started scraping the skill ratings today, and as I'm only scraping current skill data and my historical data scrape is still catching up, odds are I won't be able to compare the two data sets for a couple weeks.

  4. #14
    Basic Member MuppetMaster42's Avatar
    Join Date
    Nov 2011
    Location
    Australia
    Posts
    585
    Quote Originally Posted by Aardvarki View Post
    Of course I am. How else can I ever catch up to live data? GetMatchHistoryBySequenceNum returns 100 matches (with full detail) from a single API call. I'm only hitting the API 35,000 times per day, but I'm pulling back 3,500,000 matches a day.
    My apologies - I thought you meant were hitting the API with 100 actual calls per 2.6 sec.
    If you're just making a single API call/2.6 sec then that's a-okay.

  5. #15
    Basic Member
    Join Date
    Mar 2013
    Posts
    28
    Yeah, the phrasing was awkward, no harm. I'm a database developer during the day (with emphasis on DB optimization) so respecting the limits of software and hardware is pretty ingrained in my psyche. Besides, I'm sure valve would've shut me down by now if I were going THAT overboard on api calls. Right now, however, I'm probably pushing the envelope - I'm making my stats scrape every 2.6 seconds and my skill scrape every 1.5 (averaging just a hair under 1 request per second). I think the real limit is 100,000 per day (not 1/second or 86,400 per day), so I think I'm still fine (and since I've threaded the jobs, they both run concurrently with no performance hit).

    Also, my response to you may have come across far more gruff than I intended it to, so sorry about that! I'm actually a nice guy!

  6. #16
    Basic Member jimmydorry's Avatar
    Join Date
    Dec 2012
    Posts
    814
    Quote Originally Posted by Sproinknet View Post
    There's a huge dump here which might speed up getting up to date for you somewhat (should have all games until December 2012)
    Quote Originally Posted by Aardvarki View Post
    I actually downloaded the dump already, but my scripts had nearly made it as far as the dump got by the time I finished downloading it (it took me a solid week averaging under 60kb/s) - between that and standards differences (I use mySQL) I decided in the end not to use it. However, it's awesome that you put it out there, and I wish I could've made use of it.
    *SNIP
    This is exactly right. I still haven't setup a PostGRE instance to make use of the DB backup, which will probably take a few days to restore. When the PostGRE restore is completed, I will be converting to MySQL regardless.

    I may hold off on this en-devour, and wait and hope for Aardvarki to potentially share a complete MySQL backup. =P A combination of lack of SSD space, and lack of consecutive time blocks to throw at the conversion... have prevented me from making use of the PostGRE dump.

  7. #17
    Basic Member
    Join Date
    Mar 2013
    Posts
    28
    Quote Originally Posted by jimmydorry View Post
    I may hold off on this en-devour, and wait and hope for Aardvarki to potentially share a complete MySQL backup. =P A combination of lack of SSD space, and lack of consecutive time blocks to throw at the conversion... have prevented me from making use of the PostGRE dump.
    I'll host a torrent once I'm caught up. It'll be at least a week, however. I managed to rewrite my scripts to take advantage of threading, which improved my throughput by about 45%, however I've still got a solid 40 million matches to retrieve.

    Unless you'd prefer one sooner. I can always share an incomplete batch. By the time I'm revisiting this thread tomorrow I'll probably be nearing the end of 2012's matches.

  8. #18
    Basic Member jimmydorry's Avatar
    Join Date
    Dec 2012
    Posts
    814
    Awesome. There is no rush. The more data, the more I can look at at once.

  9. #19
    Basic Member
    Join Date
    Mar 2013
    Posts
    28
    Just so you don't think I've forgotten - I should be finished with the data retrieval (to current) by the end of the week. I'm not sure how long it will take to make a presentable dump - I plan on making a few database changes (adding a few indices where I didn't think they were needed before the database grew as large as it did).

    Right now (in un-optimized key form) the database is over 300GB. I'm sure a dump will be considerably smaller, but I'm not really sure by how much. I'm guessing the final dump will likely be in the ~150GB range. I can share it with or without matches that I can easily tell don't count, and with or without the skill build table (the largest table in the database - 4 billion rows so far) - taking both of those out should cut the size to under 100GB.

  10. #20
    Basic Member
    Join Date
    Mar 2013
    Posts
    28
    Quote Originally Posted by Phantasmal View Post
    If High is incomplete, and incomplete in a way that specifically discounts what tend to be the most important matches, then getting complete skill data might be impossible, at least through this method. For some particular 24 hour period, do you have the number of games recorded through Sequence and the number of High and Very High games recorded through timed scraping?
    I've been processing live skill data for a few weeks now and have found some interesting stuff. I know I'm getting the vast majority of very high skill games (some checking shows I'm marking less than 1% as unknown) and also the considerable majority of high skill games (less than 20% unknown). These two bits of data plus the full number of games recorded now over the same timeframe give me a pretty good guess as to the percentages of normal/high/very high games being played.

    Yesterday, for example, I recorded 20095 Very high skill games. Assuming I missed less than 1%, that means somewhere between 20095-20300 very high skill games were played.
    In the same timeframe, I recorded 41073 High skill games. Assuming I missed less than 20%, that means somewhere between 41073-51342 high skill games were played.
    I also recorded 280414 games that I listed as "unknown" - knowing this means I know 341582 games happened yesterday. With my ranges for H/VH, I know between 269940-280414 normal skill games were played.

    So I can say with a good amount of certainty, yesterday, between 79.0-82.1% of all games were Normal, 12.0%-15.0% were High, and 5.9%-6% were Very High.

    These ranges are based on my full range from zero error to my calculated error. Since the odds are that my calculated error is more accurate than assuming zero error, it's most likely that the actual split is roughly 79/15/6.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •