Jump to content

Analysing collected data


Recommended Posts

We collect a load of data from users (currently about 20K distinct users, 50K hardware reports, 80K performance profiling reports, 2K text messages; about 1.6GB of text in total (uncompressed)). Currently it's all stored in a MySQL database, with most of the data stored as JSON in a text column (so the data collection code doesn't have to know schemas of every message type the game might send). The hardware data looks like this; profiling data is like the F11 profiler mode. Queries are very slow, since they have to copy all the JSON text from the MySQL daemon to the application and decode it all and do most of the processing manually in Python. There's a process that transforms the OpenGL data into a series of separate tables that are designed for fast queries for the web interface, but it'd be nice to support more ad-hoc queries without requiring that much effort.

I've started playing a bit with MongoDB to see if it could work for this (actually mostly just because it's fun), and it seems alright. It's basically a system for storing lots of schemaless JSON documents, and then doing queries or map-reduce processing over them. It takes about five minutes to import the data from MySQL but then it can be accessed efficiently with no more JSON decoding.

That means it's possible to write some Python code like


connection.userreport.hwdetect.find(
{
"data_version": {"$gte": 2},
"$where": "!(this.data['x86_caps[1]'] & (1<<26))"
},
{"data.cpu_identifier": 1, "data.gfx_card": 1}
)

to find hwdetect reports (of version 2 or greater) without the CPU capability flag indicating SSE2 support (tested by a JavaScript function that runs on the server), returning the CPU identifier and graphics card, and it runs in about a second (with no indexes, and sufficient RAM to keep things cached). (Formatting the result gives this, showing at least a few people with acceptable graphics cards but no SSE2.)

Some other code like


map_caps = Code("""
function() {
var c0 = this.data["x86_caps[0]"];
var c1 = this.data["x86_caps[1]"];
var c2 = this.data["x86_caps[2]"];
var c3 = this.data["x86_caps[3]"];
emit({caps: [c0,c1,c2,c3], user: this.user_id_hash}, 1);
}
""")

reduce_sum = Code("""
function(key, values) {
var count = 0;
values.forEach(function(val) {
count += val;
});
return count;
}
""")

map_merge = Code("""
function() {
emit("total", 1);
emit(this._id.caps, 1);
}
""")

mconn.userreport.command("mapreduce", "hwdetect",
query = {"data_version": {"$gte": 2}},
map = map_caps, reduce = reduce_sum, out="temp_cpu_caps"
)

caps = mconn.userreport.temp_cpu_caps.inline_map_reduce(map_merge, reduce_sum)

does a couple of map-reduce jobs to count the number of distinct users with each set of CPU capabilities (plus the total number of distinct users), giving this data in about five seconds.

I can't run this on my web server, since the database is too large for a 32-bit address space (MongoDB mmaps the entire thing) and it likes lots of physical RAM (for caching) and it still has too high CPU cost for real-time usage, but it seems convenient and fast enough for offline analysis.

It'd probably be useful to publish some analyses of this data and update them over time (maybe run it over each 3-month period), like the Steam or Unity ones. I don't know if I'll have time to do this soon, but just in case: what kind of reports (tables or graphs based on the hardware/profiling data) would be interesting or useful? (If they're mostly variations on a few themes then it should be easy and quick to do lots of them, so hopefully this shouldn't take much effort.)

Link to comment
Share on other sites

It'd probably be useful to publish some analyses of this data and update them over time (maybe run it over each 3-month period), like the Steam or Unity ones. I don't know if I'll have time to do this soon, but just in case: what kind of reports (tables or graphs based on the hardware/profiling data) would be interesting or useful? (If they're mostly variations on a few themes then it should be easy and quick to do lots of them, so hopefully this shouldn't take much effort.)

Pie or bar charts, they can either be for all samples or averaged over fixed periods:

  • OS (Windows/Linux/OS X)
    • Windows version (2000, XP, ME, Vista, 7, ?)
    • Linux distro, and maybe kernel if it matters?
    • OS X release (10.5, 10.6, 10.7, etc.)

    [*]Graphics card maker (ATI/AMD/nVidia/Intel: possibly subchart for each?)

    [*]CPU vendor (AMD/Intel: possibly subchart for each vendor to show important distinctions, not just clockrate?)

    [*]System architecture (32/64 bit and/or 32-bit userspace on 64-bit)

    [*]RAM (under 512MB, 512-1GB, 1-2GB, 2-4GB, 4+ GB)

    [*]Supported OpenGL version

Graphs:

  • Release adoption (does anyone still play A3?)
  • Avg. framerate by release
  • Avg. framerate by graphics card (and maybe other criteria, like OS)
  • Some moving averages (maybe 1 month window) of above data to show adoption of new technologies

Link to comment
Share on other sites

If we need to recruit someone to do this, I can help spread the word. Just let me know.

I just got an email from Ryan saying he's got a friend with many years PHP experience asking if we have anything he might help with. I'll send him a link to this thread so he can forward it to his friend to see if he's got the skills needed (I know PHP isn't mentioned above, but I have a feeling it's not too unlikely he might have other skills as well. Otherwise we might have some other suitable tasks. I don't remember the latest discussions, but perhaps his skills could be relevant for the matchmaking server? If we are going to do it anything remotely like having a web server act as the main hub, but anyway, I'm glad there are other people than me who knows that ;) Or if nothing else he might be of assistance once we implement the new web site, but it seems the issue there is more of getting a solid design together.).

(It's probably a sign of something when the parenthesis is more than double the size of the rest of the paragraph :P And probably not a good sign ;) )

Link to comment
Share on other sites

  • 5 months later...

Hmm, I've still not done a lot with this, but I updated the CPU capability data here to give details of exactly which CPUs do/don't support each feature, in a hopefully not incredibly unreadable fashion.

(Does anyone happen to know what the "eax=8...1h ecx[...]" bits (i.e. bits 18/23/24 of ECX after running CPUID with EAX=80000001h) are meant to signify?)

Link to comment
Share on other sites

Hmm, I've still not done a lot with this, but I updated the CPU capability data here to give details of exactly which CPUs do/don't support each feature, in a hopefully not incredibly unreadable fashion.

(Does anyone happen to know what the "eax=8...1h ecx[...]" bits (i.e. bits 18/23/24 of ECX after running CPUID with EAX=80000001h) are meant to signify?)

ECX bit 24 (PCX_NB): NB perf counter extensions (MSRs C001_024[0...7]h)

ECX bit 23 (PCX_CORE): core perf counter extensions (MSRs C001_020[0...B]h)

Source: http://www.sandpile.org/x86/cpuid.htm

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...