Page 2 of 5 FirstFirst 1 2 3 4 ... LastLast
Results 11 to 20 of 45

Thread: Let there be data!

  1. #11
    Basic Member
    Join Date
    Nov 2012
    Posts
    35
    Quote Originally Posted by Vimsey View Post
    if a column is always empty define it as null it saves even more space.
    Those columns (and there's a totals one iirc as well) will be vanishing as we've scrapped that stuff for now. We may also remove the bugged barracks status (one is a mirror of the other for all values sadly), and total_item_cost (as we haven't filled it in yet!).

  2. #12
    Basic Member
    Join Date
    Sep 2012
    Posts
    23
    Quote Originally Posted by Sproinknet View Post
    There is certainly some redundant information to be removed and then some optimisation to be done; we'll do what we can to reduce the size of the final dumps. This sort of feedback is why we released the smaller dump first!
    I'm not talking only about dump, you don't care your db size? +1GB daily as you say.
    For each match there is 10 match_players rows. Each row has 26 int cells from which at least 18 (player_slot, hero_id, kills, deaths, assists, leaver_status, gold, last_hits, denies, gold_per_min, xp_per_min, level, item_0_id, item_1_id, item_2_id, item_3_id, item_4_id, item_5_id) can be smallint. As a result of every match you save 2 (bytes) x 18 (cells) x 10 (players) = 360 bytes. And for 100 milion match records in db it is ~35GB saved disk space!!!!

    Use the most efficient (smallest) data types possible. MySQL has many specialized types that save disk space and memory. For example, use the smaller integer types if possible to get smaller tables. MEDIUMINT is often a better choice than INT because a MEDIUMINT column uses 25% less space.
    It's from MySQL, but the same applies for pgsql: http://dev.mysql.com/doc/refman/5.5/en/data-size.html

    Adn about dump... player_id in match_players table seems some your own id and can be removed from dump.

    Quote Originally Posted by Vimsey View Post
    if a column is always empty define it as null it saves even more space.
    I don't have big experience with Postgre, but with mysql somethimes null is not the best option from performance point of view.

    Declare columns to be NOT NULL if possible. It makes SQL operations faster, by enabling better use of indexes and eliminating overhead for testing whether each value is NULL. You also save some storage space, one bit per column. If you really need NULL values in your tables, use them. Just avoid the default setting that allows NULL values in every column.
    http://dev.mysql.com/doc/refman/5.5/en/data-size.html
    Last edited by JohnnyAjax; 12-15-2012 at 11:12 AM.

  3. #13
    Basic Member VeRychard's Avatar
    Join Date
    Feb 2012
    Posts
    292
    I guess you could just query all the matches 1 by 1 for every player, that's the only logical way to me (like from dotabuff) and that's why it takes so long and consumes so much space (10s of gigs per day)
    >Click< here to make a clicking sound.

    Overall average (K+A)/D and K/D tracker, and a MMR solution?


  4. #14
    Basic Member
    Join Date
    Nov 2012
    Posts
    35
    Quote Originally Posted by VeRychard View Post
    I guess you could just query all the matches 1 by 1 for every player, that's the only logical way to me (like from dotabuff) and that's why it takes so long and consumes so much space (10s of gigs per day)
    You seem to be missing the point - we have the data (and using a method which doesn't rely on dotabuff being up and not banning your IP!). The number of new games played each day is why the data grows at a rate of several hundred meg a day.

    Thanks to JohnnyAjax for poking us into not being lazy; we'll work on neatening up the columns over the next few days, but it may take a while to do the conversion and to then export the entire database.

    Edit: It looks like it's going to be a couple of days of solid database hammering to alter the columns even if we don't run into problems and are completely organised about things, so the next waiting period is not going to be entirely our fault!
    Last edited by Sproinknet; 12-16-2012 at 10:04 AM.

  5. #15
    Basic Member
    Join Date
    Feb 2012
    Posts
    57
    Thanks for providing this. I have a general tech question for anyone hanging around who might have some related experience.

    Is there an easily available method of manipulating the database through Python to create a class structure for the match information? I've been able to look at the database through pgAdmin III, but I can't see a way I'd be able to use it to manipulate the data to the extent I've played around with my personal collection of API matches.

    My best guess would be using something like Psycopg2, but I've been unable to connect using the information from pgAdmin. Do I need to set up a more formal database than what I'm doing through pgAdmin to access it through other means? Is there an easy way of scraping the table information directly from the file without having to set up any kind of database? I don't have any experience trying to operate a database, so I could be going at things in a terribly backwards manner.

  6. #16
    Basic Member
    Join Date
    Nov 2012
    Posts
    35
    Quote Originally Posted by Phantasmal View Post
    Thanks for providing this. I have a general tech question for anyone hanging around who might have some related experience.

    Is there an easily available method of manipulating the database through Python to create a class structure for the match information? I've been able to look at the database through pgAdmin III, but I can't see a way I'd be able to use it to manipulate the data to the extent I've played around with my personal collection of API matches.

    My best guess would be using something like Psycopg2, but I've been unable to connect using the information from pgAdmin. Do I need to set up a more formal database than what I'm doing through pgAdmin to access it through other means? Is there an easy way of scraping the table information directly from the file without having to set up any kind of database? I don't have any experience trying to operate a database, so I could be going at things in a terribly backwards manner.
    • pgAdmin allows you to run SQL code - more useful for getting data than doing anything with it, but useful nonetheless!
    • Psycopg2 I've not used, but you should be able to connect to the database - if pgAdmin is working you have a database, so it must be a permission/password/firewall/etc issue; see if there's a log file somewhere which can help you troubleshoot.
    • If you're intending to do something web based in Python look into Django, the ORM is pretty easy to use and it's a good system for doing an entire website in. Personally I'd make sure you're on Python 2.7, but I believe the latest version is P3 compatible.


    Hope that helps!

  7. #17
    Basic Member
    Join Date
    Feb 2012
    Posts
    57
    It seems to have been a permission/password issue. We went through the SQL shell instead of trying to work through pgAdmin and managed to get things where the script will connect. It was definitely helpful to know that the database was actually working because it narrowed down the scope of possible problems.

    I don't intend to do anything web based. I just do a bunch of measurements and calculations, and it's easiest for me to do them if I can convert your match data to the object class I've been using from the API. With the connection issues solved it's hopefully just a matter of reading and transferring the data.

  8. #18
    Basic Member
    Join Date
    Dec 2011
    Posts
    83
    Any thoughts on getting this into MySQL? I don't have anything special planned but if there is an easy way I'd be happy to play with it and see where it ends.

    I have downloaded the backup and imported in my local postgre db and then made an export with the parameter --inserts as stated at http://www.lightbox.ca/pg2mysql.php, then ran this pg2mysql.

    But its a big pain and at the end it didn't work completely, at some point during importing it into mysql it encountered an unexpected \z.

    I did mange to get all the 2,456,150 accounts but I had the error somewhere during match_players I believe.

    Not that that's important right now I have no way of automating it like this anyway.

    Any suggestions from OP or others are welcome, I think I won't be the only one wanting this in MySQL.

  9. #19
    Basic Member
    Join Date
    Nov 2012
    Posts
    35
    I don't have much experience with Postgres -> MySQL conversions, but we were previously storing this database using Sqlite and found http://kettle.pentaho.com/ to be a really good tool when we moved from Sqlite -> Postgres.

    One of the reasons we didn't go for MySQL is that there are some unicode gotchas, which would be a big issue for Dota 2 where people just love to put all sorts of weird symbols in their names. The names for the players for each game all go in match_players, so if I were to have a guess at the most likely cause of your issue then I'd pick that. I'm not sure what approach to suggest with testing and fixing this though. If you have access to the table creation SQL that is run as part of your migration, then you could check it to make sure that it's using appropriate UTF types on text.

  10. #20
    Basic Member
    Join Date
    Dec 2011
    Posts
    83
    It seems there are players with a \ in the name, like '\\\' is a value as name in the insert entry, that gives problems. Can't seem to fix it easily in the converter as it doesn't look at the values.

    Well thanks anyway I guess I'll check in on this later, see how it develops.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •