You can easily run the performance test on your own hardware after you’ve cloned the roboquant GitHub repo. All it takes is a single command from the command line:
./mvnw compile exec:java -pl roboquant
When designing and developing roboquant, performance and scalability was one of the key objectives. The design principles that have been followed to ensure the best performance possible are:
Much of the builtin functionality uses multithreading and Kotlin coroutines to speedup processing. For example the CSVFeed parses CSV files in parallel so directories with thousands of CSV files can be parsed in seconds. So modern multicore CPU’s can be fully utilized.
Roboquant is designed NOT to be a financial ledger, and in general algo-trading applications don’t require this. The native Java Double type provides enough precision for algo-trading purposes and there is no need to use (the much slower and more memory hungry) BigDecimal type. BTW, the area where roboquant uses high precise calculations is when dealing with order- and position-sizes.
Avoid (auto)boxing, so the JVM can access these variables directly. This means where possible use native types and not the wrapped ones. Read also this autoboxing article from Oracle.
Optimized paths for common use-cases, like for example trading in a single currency only.
Reuse objects to allow for faster referential equality comparisons. For example the common use-case of accessing a Map
with an Asset
as key, benefits a lot from this.
The Feed API supports data that is kept in-memory as well data that is stored on disk and only accessed when required. This allows for running back test where the test data doesn’t fit in memory or running back-tests on machines with limited memory. In fact roboquant can be used on a JVM with only 200 MB of heap allocated.
Avoid unnecessary copying of objects in order to limit memory allocations and garbage collection. The overall latency is kept to a minimum when processing new market data events and generating the corresponding orders. So it is feasible to create fast, low latency (millisecond) trading strategies.
Optimized collections, so they can be accessed efficiently even if back tests grow in size. For example, open- and closed-orders are maintained in two separate collections so access to open-orders remains fast even if total number of generated orders in a back test grow to over 100.000.
With all these measures in place, roboquant is very fast while still have the benefit of being able to use a high level language.
In order to test the performance and avoid performance degradation with new releases, a standard performance test is included in the test suite. The remainder of this page provide an overview of that performance test run on different hardware.
The performance test measures the performance of the following 4 scenarios with different parameters (number of assets and events):
Iterate over a feed once and filter for a particular asset.
Run a single back test using full setup with margin trading, 2 metrics and 3 strategies
Run multiple back tests sequential using a a bare-minimum configuration
Run multiple back tests parallel using a a bare-minimum configuration
It is important to note that:
the performance tests are designed to measure the performance of the back-test engine and not individual strategies, feeds, metrics or policies. So your performance may differ based on the used components in your back-tests.
the used operating systems and JDKs are not further tuned and used with their out-of-the-box configuration settings.
In order to improve accuracy, each test is run several times to rule out other activities on the machine. Due to this, running the performance test will take some time to complete. |
This is an Apple Silicon (ARM) based laptop with 8 CPU cores and 16GB of memory. Is has OpenJDK 19 installed. It is also the hardware used for most of the development of roboquant.
As you can see from the output below, the maximum throughput is 227 million candles per second when running parallel back tests. For a laptop this is impressive. When you look at a sequential runs, the performance even gets better. It beats powerful and much more expensive server instances by a wide margin.
_______ | $ $ | roboquant | o | version: 1.5.0-SNAPSHOT |_[___]_| build: 2023-05-10T21:28:28Z ___ ___|_|___ ___ os: Mac OS X 13.3.1 ()___) ()___) home: /Users/peter/.roboquant // / | | \ \\ jvm: OpenJDK 64-Bit Server VM 19.0.2 (___) |_________| (___) memory: 4096MB | | __/___\__ | | cpu cores: 8 /_\ |_________| /_\ // \\ ||| ||| // \\ \\ // ||| ||| \\ // ()__) ()__) /// \\\ __///_ _\\\__ |______| |______| CANDLES ASSETS EVENTS RUNS FEED FULL SEQUENTIAL PARALLEL TRADES CANDLES/S ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1M 10 1000 100 4ms 9ms 88ms 32ms 1K 31M 5M 50 1000 100 3ms 6ms 99ms 32ms 5K 156M 10M 50 2000 100 7ms 11ms 181ms 59ms 10K 169M 50M 100 5000 100 14ms 55ms 707ms 229ms 50K 218M 100M 200 5000 100 24ms 102ms 1428ms 439ms 100K 227M 500M 500 10000 100 120ms 393ms 7496ms 2257ms 500K 221M 1000M 500 20000 100 243ms 778ms 14737ms 4932ms 1000K 202M
This is an AMD third generation Epyc processor based instance with 64 vCPU, 128GB memory and OpenJDK 17.
The maximum throughput is 443 million candles per second, which makes it suitable for large back tests. However, given that it has 8x the number of cores compared to the Apple laptop, the performance gain is not that impressive. So it is not the most cost-efficient solution.
_______ | $ $ | roboquant | o | version: 1.5.0-SNAPSHOT |_[___]_| build: 2023-05-11T06:49:52Z ___ ___|_|___ ___ os: Linux 5.15.0-1031-aws ()___) ()___) home: /home/ubuntu/.roboquant // / | | \ \\ jvm: OpenJDK 64-Bit Server VM 17.0.6 (___) |_________| (___) memory: 30688MB | | __/___\__ | | cpu cores: 64 /_\ |_________| /_\ // \\ ||| ||| // \\ \\ // ||| ||| \\ // ()__) ()__) /// \\\ __///_ _\\\__ |______| |______| CANDLES ASSETS EVENTS RUNS FEED FULL SEQUENTIAL PARALLEL TRADES CANDLES/S ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1M 10 1000 100 7ms 19ms 123ms 37ms 1K 27M 5M 50 1000 100 4ms 12ms 241ms 19ms 5K 263M 10M 50 2000 100 12ms 17ms 456ms 40ms 10K 250M 50M 100 5000 100 18ms 85ms 1833ms 149ms 50K 335M 100M 200 5000 100 34ms 167ms 3481ms 237ms 100K 421M 500M 500 10000 100 172ms 630ms 21594ms 1128ms 500K 443M 1000M 500 20000 100 345ms 1387ms 44519ms 2962ms 1000K 337M
This is an ARM based instance (Graviton) with 64 vCPU, 128GB memory and OpenJDK 17. The hourly pricing is slightly below that of the AMD Epyc instance, and it has the same amount of memory and vCPU’s.
You would perhaps expect that due to the long history of running server JVMs on X86 based hardware, that an ARM instance might underperform. But actually the opposite is true. The maximum throughput is 711 million candles per second, which make it the best single instance solution for large parallel back tests.
_______ | $ $ | roboquant | o | version: 1.5.0-SNAPSHOT |_[___]_| build: 2023-05-11T06:50:00Z ___ ___|_|___ ___ os: Linux 5.15.0-1031-aws ()___) ()___) home: /home/ubuntu/.roboquant // / | | \ \\ jvm: OpenJDK 64-Bit Server VM 17.0.6 (___) |_________| (___) memory: 30688MB | | __/___\__ | | cpu cores: 64 /_\ |_________| /_\ // \\ ||| ||| // \\ \\ // ||| ||| \\ // ()__) ()__) /// \\\ __///_ _\\\__ |______| |______| CANDLES ASSETS EVENTS RUNS FEED FULL SEQUENTIAL PARALLEL TRADES CANDLES/S ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1M 10 1000 100 6ms 32ms 267ms 32ms 1K 31M 5M 50 1000 100 4ms 17ms 241ms 16ms 5K 312M 10M 50 2000 100 13ms 28ms 503ms 28ms 10K 357M 50M 100 5000 100 34ms 130ms 2335ms 104ms 50K 480M 100M 200 5000 100 50ms 213ms 4393ms 166ms 100K 602M 500M 500 10000 100 256ms 798ms 19787ms 703ms 500K 711M 1000M 500 20000 100 495ms 1428ms 39755ms 1633ms 1000K 612M
_______ | $ $ | roboquant | o | version: 1.5.0-SNAPSHOT |_[___]_| build: 2023-05-24T12:04:26Z ___ ___|_|___ ___ os: Linux 5.19.0-1025-aws ()___) ()___) home: /home/ubuntu/.roboquant // / | | \ \\ jvm: OpenJDK 64-Bit Server VM 17.0.7 (___) |_________| (___) memory: 3920MB | | __/___\__ | | cpu cores: 8 /_\ |_________| /_\ // \\ ||| ||| // \\ \\ // ||| ||| \\ // ()__) ()__) /// \\\ __///_ _\\\__ |______| |______| CANDLES ASSETS EVENTS RUNS FEED FULL SEQUENTIAL PARALLEL TRADES CANDLES/S ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1M 10 1000 100 8ms 36ms 193ms 42ms 1K 23M 5M 50 1000 100 5ms 19ms 291ms 65ms 5K 76M 10M 50 2000 100 15ms 39ms 637ms 116ms 10K 86M 50M 100 5000 100 37ms 137ms 2936ms 465ms 50K 107M 100M 200 5000 100 60ms 233ms 5543ms 847ms 100K 118M 500M 500 10000 100 312ms 816ms 24637ms 3988ms 500K 125M 1000M 500 20000 100 632ms 1558ms 49457ms 8283ms 1000K 120M
This is the same ARM based instance (Graviton) with 64 vCPU and 128GB memory. But rather than using the OpenJDK that comes with Ubuntu 22.04, the performance tests are run using the Oracle GraalVM Enterprise 22.3.
The GraalVM based JDK was installed using the following two commands:
bash <(curl -sL https://get.graalvm.org/ee-token)
bash <(curl -sL https://get.graalvm.org/jdk)
Overall the performance is a bit better than with OpenJDK 17. The maximum throughput is 822 million candles per second when running in parallel. The sequential run performance is also better than with the plain OpenJDK JVM.
_______ | $ $ | roboquant | o | version: 1.5.0-SNAPSHOT |_[___]_| build: 2023-05-11T07:05:44Z ___ ___|_|___ ___ os: Linux 5.15.0-1031-aws ()___) ()___) home: /home/ubuntu/.roboquant // / | | \ \\ jvm: Java HotSpot(TM) 64-Bit Server VM 17.0.6 (___) |_________| (___) memory: 30688MB | | __/___\__ | | cpu cores: 64 /_\ |_________| /_\ // \\ ||| ||| // \\ \\ // ||| ||| \\ // ()__) ()__) /// \\\ __///_ _\\\__ |______| |______| CANDLES ASSETS EVENTS RUNS FEED FULL SEQUENTIAL PARALLEL TRADES CANDLES/S ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1M 10 1000 100 6ms 40ms 576ms 22ms 1K 45M 5M 50 1000 100 4ms 20ms 137ms 16ms 5K 312M 10M 50 2000 100 16ms 26ms 384ms 21ms 10K 476M 50M 100 5000 100 32ms 92ms 1335ms 85ms 50K 588M 100M 200 5000 100 58ms 147ms 2512ms 122ms 100K 819M 500M 500 10000 100 232ms 428ms 12192ms 608ms 500K 822M 1000M 500 20000 100 556ms 837ms 28386ms 1471ms 1000K 679M