00:54:27: <Doxin> impomatic_: that's a bold move :P
01:07:54: -!- Doxin has quit (Remote host closed the connection).
01:27:05: -!- myndzl has joined #corewars.
01:28:29: -!- myndzi has quit (*.net *.split).
06:20:38: -!- Doctor1C has joined #corewars.
07:30:27: -!- catageek has joined #corewars.
11:50:03: -!- Qu4Z_ has joined #corewars.
11:57:07: -!- Qu4Z has quit (*.net *.split).
12:49:39: -!- Doxin has joined #corewars.
12:49:39: -!- Doxin has quit (Changing host).
12:49:39: -!- Doxin has joined #corewars.
13:01:38: -!- catageek|2 has joined #corewars.
13:06:55: -!- catageek has quit (Ping timeout: 244 seconds).
13:06:57: -!- Qu4Z_ has quit (Ping timeout: 244 seconds).
13:06:58: -!- Doctor1C has quit (Ping timeout: 244 seconds).
13:10:15: -!- Qu4Z has joined #corewars.
13:16:16: -!- Doctor1C has joined #corewars.
13:59:40: -!- Doctor1C has quit (Ping timeout: 250 seconds).
14:52:22: -!- impomatic_ has quit (Ping timeout: 240 seconds).
15:38:40: -!- catageek|2 is now known as catageek.
16:19:16: -!- impomatic_ has joined #corewars.
16:26:07: -!- Doctor1C has joined #corewars.
16:27:55: <impomatic_> I've updated and extended the Hall of Fame. It now recognises all warriors which reached 400+ :-) http://corewar.co.uk/halloffame.htm
17:21:00: <impomatic_> Does anyone fancy writing a script for me? :-D
17:22:53: <impomatic_> To take a 1.6 million line 68 Mb file and split it up into individual messages, remove the junk from the headers then sort them by date (dates are in various formats).
17:23:08: <impomatic_> GW Basic can't cope ;-)
17:27:57: <Doxin> impomatic_: what exactly does it need to do?
17:28:04: <Doxin> impomatic_: I can hack some python script together for you
17:30:05: <impomatic_> Doxin: I've got a dump of every message ever posted to r.g.cw
17:30:12: <Doxin> right
17:30:20: <Doxin> impomatic_: so basically concatted email messages?
17:31:08: <Doxin> shouldn't be a too hard problem
17:32:03: <Doxin> impomatic_: what output format would you like?
17:32:27: <impomatic_> Yes, but they're not in date order. I wanted to get them in date order. The input is a plain text file and I'm after a plain text file.
17:33:30: <impomatic_> Also it's be handy to strip everything from the header apart from the subject, message-id, date, from, references and line-count
17:33:31: <Doxin> right
17:33:55: <Doxin> if you send me some sample data, say the first few mb or whatnot, I can get cracking on that.
17:34:44: <impomatic_> Fantastic, thanks :-)
17:35:18: <impomatic_> I normally write something in JavaScript to handle this kind of thing, but it's far too big!
17:35:23: <Doxin> hmm
17:37:18: <impomatic_> It hangs most Windows text editors I've tried to load it in.
17:37:42: <Doxin> yeah, you really don't want to load big files like that into a text editor :P
17:38:57: <Doxin> impomatic_: you working on getting me a sample of the data so I can test stuff out?
17:42:50: <impomatic_> Got one, but I don't think I've got your email address
17:43:06: <Doxin> oh
17:43:22: <Doxin> impomatic_: I was expecting you to dropbox it or whatever. you can send it to lieuwemo@gmail.com
17:44:44: <impomatic_> Sent, thanks. If it ends up being trickier than expected then don't worry too much :-)
17:47:00: <Doxin> of course it's going to be tricky, it's email.
17:50:00: <Doxin> jesus what encoding is that in ._.
17:50:41: * Doxin uses chardet
17:54:17: <impomatic_> Would UTF-8 be better? I can resend
18:04:26: <Doxin> nah
18:04:28: <Doxin> impomatic_: it's fine
18:04:31: <Doxin> just weird
18:04:38: <impomatic_> :-)
18:04:49: <Doxin> impomatic_: I mean I just have to set the encoding on file-open to UTF-16LE for it to work, but still :P
18:22:40: <Doxin> impomatic_: shouldn't an email dump like this be binary? or are all the mails converted to utf16le?
18:26:27: <impomatic_> I think the original in ANSI encoding, but Notepad++ decided to encode as utf16le when I copy / pasted the sample.
18:29:46: <Doxin> alright
18:30:11: <Doxin> changing encoding is pretty trivial either way
18:35:03: -!- myndzl is now known as myndzi.
18:35:40: <Doxin> impomatic_: what headers would you like to keep?
18:36:01: <Doxin> [u'X-Google-Language', u'X-Google-Thread', u'X-Google-Attributes', u'X-Google-ArrivalTime', u'Path', u'From', u'Newsgroups', u'Subject', u'Keywords', u'Message-ID', u'Date', u'References', u'Sender', u'Organization', u'Lines', u'Nntp-Posting-Host'] is the list of headers most of em have it seems
18:36:52: <Doxin> impomatic_: also, are you sure you just want a plaintext file back? it seems sensible to split it into smaller parts
18:37:45: <impomatic_> Well I was planning to split into months
18:39:23: <Doxin> that should be easy enough to do at this stage
18:40:09: <impomatic_> I was only planning to keep these in the header [u'From', u'Newsgroups', u'Subject', u'Keywords', u'Message-ID', u'Date', u'References', u'Sender', u'Organization', u'Lines']
18:40:45: <Doxin> alright
18:42:41: <Doxin> impomatic_: mind if I grab the date from X-Google-ArrivalTime if no Date is specified?
18:43:26: <impomatic_> That should be fine, hopefully none will have the date missing.
18:49:24: <impomatic_> Once it's split, I'll read through all 68 Mb to write a brief monthly summary, pick out any warriors missing from Koenigstuhl (I've found quite a few already) and remove any spam.
18:49:57: <impomatic_> Shouldn't take long. It only equivalent to about 30 odd novels!
18:53:58: <Doxin> hah
18:54:15: <Doxin> wouldn't detecting a message with a warrior in it be pretty easy to grep out though?
18:55:02: <impomatic_> Yes, should be. That's how I've found most of those I submitted to Koenigstuhl recently.
18:55:46: <Doxin> in the meantime, magicdate to the rescue!
18:58:18: <impomatic_> I just need to pick up any warriors that haven't got the proper header. It's impossible to automatically distinguish them from code examples and the name / author / etc it normally buried somewhere in the text.
18:59:16: <Doxin> right
19:10:01: -!- Doxin has parted #corewars ("Leaving").
19:10:15: -!- Doxin has joined #corewars.
19:19:59: <Doxin> impomatic_: I got something working, we can either try to get your maildump to me, or we can try to get my program to run on your pc. what do you prefer?
19:20:56: <impomatic_> I can try running it on my PC.
19:21:03: <Doxin> alright
19:21:06: <impomatic_> Python is currently installing :-)
19:21:36: <Doxin> impomatic_: you'll need python installed, and you want to install this module https://pypi.python.org/pypi/parsedatetime/1.4
19:22:44: <Doxin> impomatic_: https://gist.github.com/SuperDoxin/8fa4a2ccd10a0c34fca1 it'll expect a file called "maildump" in the current directory, and it'll output to a "dump" folder in the current directory
19:25:06: <Doxin> impomatic_: it will probably trip up on the slightest malformed date or whatnot though
19:41:02: <impomatic_> I'm going to eat, then try to get it running :-)
19:41:22: <impomatic_> It's giving me an error in line 54 so I deleted the line!
19:42:09: <impomatic_> But the one I'm struggling with is ImportError: No module named 'cStringIO'
19:42:16: <Doxin> oh
19:42:20: <Doxin> impomatic_: you installed python 3
19:42:29: <impomatic_> I assume it's a path error or something... Going to eat then solve it.
19:42:32: <impomatic_> Oh!
19:42:34: <Doxin> impomatic_: just adding brackets around the stuff after the print should fix it
19:42:46: * Doxin is stubborn and still uses 2.7
19:42:48: <impomatic_> Should I install Python 2.7 :-)
19:43:04: <impomatic_> Okay, I'll just eat, then I'll install 2.7 :-)
19:43:07: <Doxin> yepyep
19:43:24: <impomatic_> At least I'll be able to test pyCorewar under both versions.
19:43:50: <Doxin> hmm
20:00:24: <impomatic_> Python 2.7 doesn't seem to have pip
20:01:46: <Doxin> oh
20:01:48: <Doxin> not by default
20:02:01: <Doxin> impomatic_: you can download the source and just "python setup.py install" most of the time
20:02:18: <Doxin> impomatic_: installing pip might be worth the effort
20:02:36: <Doxin> http://pip.readthedocs.org/en/latest/installing.html
20:48:08: <impomatic_> Back, trying to get it running. :-)
20:48:37: <impomatic_> This bit complains that it's received some non-ascii characters :-)
20:48:38: <impomatic_> g=email.generator.Generator(tfid)
20:48:38: <impomatic_> g.flatten(msg)
21:32:15: <Doxin> back
21:32:20: <Doxin> impomatic_: oh
21:32:21: <Doxin> hum
21:32:29: <Doxin> impomatic_: it might not like being fed non-ascii
21:32:41: <Doxin> impomatic_: weird though, it should be getting unicode data so who cares
21:33:23: <Doxin> impomatic_: not sure how to fix that one.
21:33:58: <impomatic_> It's complaining that it's out of range (128) or something. I can't see how to fix it with Python 2.7.
21:34:14: <impomatic_> Python 3.? has ByteGenerator which might do the trick.
21:34:29: <Doxin> bytegenerator would probably not work
21:34:35: <Doxin> impomatic_: well yeah, I guess the email parser expects ascii only
21:34:40: <Doxin> impomatic_: I just dont know /why/
21:34:52: <Doxin> impomatic_: either way, converting to py3 code isn't too much of a hassle
21:35:21: <impomatic_> A few of the emails contain high-ASCII characters which might be the problem.
21:35:38: <impomatic_> At the moment I'm ironing out a few other things.
21:36:05: <Doxin> as far as python is concerned it should be getting unicode strings, in whatever internal representation python likes.
21:36:05: <impomatic_> It chokes on lines beginning "From " in the message body!
21:36:17: <Doxin> well yeah
21:36:25: <Doxin> because that's the format your dump uses
21:36:37: <Doxin> your mail client would have the same problem most likely
21:37:21: <impomatic_> There's "lines" in the message header that shows exactly how long the message body is.
21:38:04: <Doxin> hmm
21:38:19: <Doxin> impomatic_: I'm afraid I haven't got time right now to go fix it
21:38:49: <impomatic_> Not to worry, thanks for what you've written. :-)
21:39:06: <impomatic_> I don't think I would've managed it in x86 or Forth!
21:40:06: <impomatic_> It seems to be working nicely apart from a few cases of tricky input which I'm working on now.
21:41:26: <Doxin> impomatic_: you seem to be quite the expert on archaic languages :P
21:42:12: <impomatic_> I need to learn a modern language!
21:42:42: <Doxin> python is a great starting point
21:43:03: <Doxin> the official tutorial is great if you already know programming: https://docs.python.org/2/tutorial/
21:45:36: <impomatic_> I think I'll try that as soon as I get chance.
21:47:14: <Doxin> feel free to poke me whenever if you need help with that
21:47:20: <impomatic_> thanks :-)
21:48:06: <impomatic_> I seem to be learning a little bit as I tweak this script! :-D
21:48:44: <Doxin> python tends to be readable enough to just dive into
23:43:22: -!- Doxin has quit (Remote host closed the connection).