With the rapid advancements, financial markets are now traded more by machines than by
humans. To enable the machines to trade, they need access to high quality historical data
which can span over TeraBytes of disk space. In addition to this, they also need access to real
time data feeds to quickly react to market events. In this challenge, we take upon the task of
coming up with a scalable infrastructure which can enable distribution of this data for building
trading models, to real time automated risk managers to showing the data to users through
1. To design a scalable fault tolerant system that can process and store incoming data
which may have possible missing data points.
2. The data storage should be done in a way to make querying efficient.
Given a process (MarketDataReader) which writes incoming data at a high speed to a local file
in a comma separated value(CSV) format. Each line in the file consists of the following 3 fields
delimited by comma:
1) Security symbol
○ Type – An alphanumeric string less than 8 characters long representing a
○ Each product ticker uniquely corresponds to a security traded in the market and
○ Type – A decimal number rounded upto 2 decimal places
○ Represents the last traded price for the product on the stock exchange.
3) Timestamp – The timestamp for the data in milliseconds since UNIX epoch.
Your task is to write a process which can efficiently read from the file and store the data in a
data store which is optimized for the queries mentioned below.
COMMON QUERY USE CASES:
Stock price data is one of the most important and fundamental data source for any FinTech
firm. It is the most fundamental unit of data based on which most investment strategy models
are trained. It is the basis for building and executing high frequency trading models which help
us get the best price while executing trades. It is also essential for us to send real time
updates to our clients about their portfolios and provide a transparent user experience. We will
discuss a subset of queries for each use case for which the data store should provide efficient
For training investment strategy models
– End of day prices for a set of products over a given date range.
For training high frequency trading models
– Given a product, and a date range (less than 30 days), return a series of prices at 1
minute intervals. (Give example)
For executing high frequency trading models
– Real time data for a given security. An API which can retrieve the latest price for a given
For providing real time updates
– Depending on the different timelines for the graph being plotted, you need to build APIs
to return price data corresponding to different time ranges.
– For example, for a call corresponding to getting data for a 5Y plot, we only need
one data point per week. The price for that data point can be considered as the
average of end of day prices for each day in that week.
– For a call for getting data for a 1Y plot, we only need one data point per day. This
data point will be equal to the average of all prices received for that day for that
– For one day, we will need data points at one min frequency where the price will
be the average of all prices received in the last one minute.
CHOICE OF DATA STORE:
You are free to use any data store of your choice. You can process and store redundant data if
that can make the above queries faster.
We expect you to provide an architectural diagram of your data pipeline . The data schema for
your data store should be clearly mentioned with explanation. You should clearly explain your
design choices, the pros and the cons of your design. It would help if you also think about
extensibility, and maintainability of your solution, so it is able to support newer APIs which need
data for different frequencies/timelines when the need arises.
You will also need to write the processors/ background jobs in a language of your choice and
provide programming APIs which implement the above required API functions so that data
from your proposed data store can be directly queried using those APIs.
You are free to choose any programming language of your choice to provide the APIs
(preferably Python or C++). The preferred environment for developing would be any Linux
distribution such as Ubuntu.
Submissions will be judged on the following criteria:
– Efficiency of the APIs listed above in terms of time complexity, network calls. The
data-structures used to store data to-from data store in the programming APIs.
– Storage used. You are free to store more data as long as it is justifiable.
– Extensibility of the design. As and when the requirement arises, we may need more
variations in the above APIs. Your design should make it easy to support extra queries.
– Code quality. There is nothing more pleasing than good readable code.
– Scalability and Design of your architecture. Fault tolerant behaviour of your architecture.
COMMONLY ASKED QUESTIONS:
Q. What is the frequency of data being written by MarketDataReader?
A. You can assume that data is being written into the file with an average of 4 data points per
security per second.
Q. Is there any assumption I can make about availability of data?
A. No. It is possible for less frequently traded (illiquid) products to have data coming in at a
much lower rate. You can assume that there will not be any corrupt data in terms of format.
Q. How many products are their in our data set?
A. You can assume that we have 100 products in our data set. However, the design should be
scalable enough to support more with less effort.
Q. Can I use cache layers and background processors to improve querying?
A. Of course!
Q. How do I get a MarketDataReader ?
A. You can implement your own MarketDataReader as a small script which generates random
data for a given set of 100 products and writes them to a file in the format specified above.
Q. What is the duration of historical data available ?
A. You can assume that you already have the output files of MarketDataReader for last 5 years.
Q. What are the timings for Data feeds?
A. The data comes in only during trading hours, i.e. in this case only between 9:30 AM EST to 4
PM EST on weekdays from Monday to Friday.
Q. Can I use any open source libraries in my submission ?
A. Yes, absolutely. In fact, you’re encouraged to use any open-source projects in your
submission if needed.
Security – Any tradable financial instrument like a stock, bond, ETF which can be traded at a
stock exchange, like Google , Apple, Microsoft stock. Each product is uniquely represented by
a symbol/ticker like AAPL for Apple, MSFT for Microsoft, etc. For purpose of this challenge, you
can assume that the symbol remains constant for a given product.
Price – The last traded price of a product. For the purpose of this challenge, you can assume
that price at timestamp T = the latest price received before or at timestamp T.
qplum LLC | https://www.qplum.co | @qplum_team
Engineering Challenge 2016
Investment Strategy – A set of machine learnt rules which define relative allocation within a
predefined set of products. For example: if you have a strategy A which has a set of products,
(X,Y,Z), then based on the market data received till now, a strategy may allocate weights in the
ratio 1:2:3 to X:Y:Z in your portfolio. At qplum, we update our strategies daily with market data
till previous day, get the weights and then use them to maintain positions on our clients’
Execution – Once a strategy gives a set of < product, weights > pairs to allocate in a portfolio,
we need to place those orders at stock exchange to maintain the portfolio in the same
allocation. This is called order execution. We use high frequency trading algorithms to execute
our orders so that we can make predictions in market prices, and detect trends to get buy
cheaper or sell at a higher price.
Send your completed queries/submissions to email@example.comfirstname.lastname@example.org