When reproducing scientific results is a CTF challenge

DCNick3 🦀

So, science. Science men always talk about importance of experiment reproducibility. The Computer Science can be the most reproducible science that ever existed, as reproducing a lot of results is just another program execution away.

Unfortunately, it's not how it is in real life. Most CS papers do not publish the source code, making it quite hard to validate results. Machine Learning is even worse, as the learning process is stochastic, meaning you don't get two exact same models even with the same learning process and data.

I am writing my thesis right now and I wanted to compare my work with a previously existing one: DeepDi. During literature review I noticed that they have published a github repository that can be used to run their model. It doesn't contain the full source code and the license is quite restrictive, but it's something right?

What I didn't notice back then was that running this model required some sort of key...

Well, no problem, right? Just follow the link, obtain the damn key and get measuring! Except, there's nothing there.

Honestly, the whole ordeal with adding DRM to a code that is supposed to be used for repeating the experiment is already pretty weird. Now the service for getting the keys is down and there isn't a way to run the code. Quite bad for science, don't you think?

Well, but maybe there's is a way. This all looks like a typical "crackme" challenge I read about back in the day. Maybe there is a way to get the model working if we look inside.

What's in the box?

The DeepDi repo consists of some docs, Dockerfiles, a license files and, most importantly, the code: DeepDiCore.so the file with the actual model & runtime inside and DeepDi.py - a small python CLI wrapper for it.

It's the API of DeepDiCore.so that wants us to give the key. It is a python module and also a native linux shared library. Let's hook up IDA Pro and look at what's happening there!

It loaded without problems, and, after a quick glance, I found a constructor for class Disassembler, which, same as the python class, takes a string and a boolean, apparently the key and whether to use GPU.

Extract from the Disassembler constructor

The function is quite large, but first thing it does is call GetModelFile with the key as our parameter. This is very likely what we want to have a look at if we want the model to run without the key ;)

It appears to be using Microsoft's C++ Rest SDK to make an HTTP GET request with the key & current unix timestamp to https://data.mongodb-api.com/app/emailservice-ydzcr/endpoint/deepdi?t=TIME&key=KEY. So, uh, it actually pings back to mothership with your key every time you use it. Nice stuff, right?

The API returns Invalid request if the time is not close to the current time. Otherwise it says Invalid key and... Well, fair enough, I don't have it.

Now let's look at what the code does with the result:

After getting the response from the API, the program extracts a 32-bit integer (!) from the response a XORs it with the timestamp sent to the server.

After this it is passed to powerMod function, which sounds like it raises the value to the power of 0x10001 modulo 0x8CAC32D7, a pretty common operation in RSA encryption algorithm. However, in this case it's not real encryption, just another obscure way to convert one 32-bit value to another.

After running the destructors, the computed value is returned from GetModelFile.

So, the GetModelFile function doesn't actually get us a "file". Maybe it's a key that's later used for some crypto?..

Model decryption (from Disassembler constructor)

Well.... If you could call XOR with one-byte key a crypto....

So, uh, all this HTTP requests stuff results in just a single significant byte.

A single byte key is really trivial to brute-force. If you are persistent, it's even possible to do it manually. However, I am lazy and decided to write a script for that:

(you can also grab it from this gist)

The model is in onnx format, and they tend to contain the string pytorch somewhere in the beginning. I decided to use it as a marker of the correct key.

Then it was just a matter of patching the GetModelFile function to always return the correct key, without all that HTTP request nonsense.

The end?

It's kind of sad that ppl go out of their way to publish model for others to reproduce the results AND to add a DRM that prevents you from running it.

In the end it turned out that all their protection was severely flawed (as it usually is in the crackmes). It could have been much harder to crack, for example, an AES cipher. Good for me I guess.

Never expected to be doing crackmes reverse engineering to compare my thesis with previous work... At least it's a good excuse to have fun doing the research?

When reproducing scientific results is a CTF challenge

What's in the box?

The end?

Report Page