Algorithms Competition

The challenge here is to write a very fast and correct implementation of a tertiary random forest.[ Instruction Files ]
Please email solutions to data-science-challenge (AT) tworoads (dot) co (dot) in with a short-name like “mithrandir”.

Motivation :
A majority of data science applications also emphasize speed of computation along with accuracy. There is anecdotal evidence of the fact that non linear decision tree methods have performed well in a variety of data science applications. The challenge of using it in the domain of HFT will be, among others, implementing it efficiently. On most events, there are some “knee-jerk” reactions where the response function is largely simple and computing the short term effect of the event is easy. Then there are more complex responses. To model them we need more sophisticated perhaps non-linear models that take more time. Using a sophisticated model for an elementary prediction would leave it vulnerable to be too slow at the task. This problem is very prevalent in the domain of finance. In finance, one encounters relationships that hold in even very small durations, like in ten seconds after an event has occurred. Then there are relationships that seem to not hold consistently over small durations but show up more often when one looks at longer periods like months and years.Input:
You can download a collection of data files each of which is a data set of indicator values which have been snapped at regular intervals. The structure of the data file is further explained in README.txt in the instruction files. We have written a wrapper file, process_data.cpp, that reads the data and calls the function OnInputChange on the TertiaryRandomForest class. The two arguments of the function are the index of the input variable that has changed, and the new value. For instance if indicator 5 has changed and the new value of the indicator is -1, process_data.cpp will call OnInputChange ( 5, -1 ) on the TertiaryRandomForest.

Output:
To measure correctness, at every ‘samplingrate’ number of function calls, the predicted value is printed. We will try to compare our benchmark solution to yours, and as long as every prediction differs by not more than 1%, we will consider the values to be correct. We allow a margin of error to account for any floating point errors in computation as well as allow any optimizations that might be possible with approximate computations. In this domain a very small difference in predicted price should not affect the outcome. If that margin of error allows one to reduce latency the benefit is often more than the cost.

FAQ:
Q: When is the deadline for submission ?
A: We would like to close the competition on APR-15-2014

Q: What do each of the files mean in the instruction files ?
A: Please look at the README.txt in the instruction files.

Q: How do I get sample data to run process_data on ?
A: This is an extensive collection of data files to test on.
Q: Would you be providing examples of the forest info file ?
A: The file ‘sample_random_forest.txt’ in the instruction files is a valid example. For further testing we request you to refer to the file RANDOM_FOREST_DESCRIPTION.txt in the instruction files to be able to create more test cases. The file Sample_Random_Forest_Spec.pdf in the instruction files will provide you a graphical representation of a sample random forest.
Q: Would you be providing an output file to check output with ?
A: No. We expect the participant to check the output themselves, based on their understand of a random forest.
Q: Would I be a good fit in the company if I do well in this competition ?
A: I am sure a person who excels in this competition will be a great fit in the company. That is a part of our motivation.