Why Your Trading Backtest Lied to You – And What It Cost Me to Find Out

11 min read
The backtest is not a simulation of reality. It's a simulation of a market that doesn't exist.

By Felix – founder of unCoded, trading since 2016.


I've run a trading bot in live markets since 2020. Before that I backtested strategies for months. The backtest numbers looked good. The live results looked different.

Not catastrophically different. Not "I lost everything" different. But different enough that I had to rebuild the backtesting engine from scratch – twice – before I trusted the numbers it produced.

This is what I learned.


The backtest is not a simulation of reality. It's a simulation of a market that doesn't exist.

Every backtest makes assumptions. Most of them are wrong.

The most common assumption: your order executes at the price you see on the chart. In live trading, this is almost never true. The chart shows you the last traded price. Your order interacts with an order book, competes with other participants, and fills at whatever price the market gives you in that moment – which may be better or worse than what the chart implied.

The difference between assumed fill price and actual fill price is slippage. It's invisible in most backtests. It's unavoidable in production.

The second assumption: your order doesn't affect the market. For small retail positions on liquid pairs like BTC/USDT this is mostly true. For larger positions or thinner pairs, it isn't. If you're placing meaningful size, you move the price against yourself as you fill. Your backtest assumed you bought at $X. Your live bot bought the first part at $X, the next part at $X+0.02%, the next at $X+0.04%. The strategy that worked at $X doesn't work at $X+0.06%.

The third assumption – and the one that killed the most of my early strategies – is that your stop loss triggers at the price you set it at.


The candle close problem that nobody talks about

Standard backtesting evaluates strategy logic at candle close. Meaning: the backtest checks conditions at the end of each candle and decides what would have happened.

Here's what that misses.

A candle has four prices: open, high, low, close. A lot happens between the open and the close. Price can spike down to the low, trigger your stop loss, and then recover to close near the open – all within one candle. A candle-close backtest looks at that candle, sees a close near open, and records: no stop triggered, position still open.

In live trading, your stop triggered at the low. The position closed at a loss. The backtest says you didn't lose that trade. The exchange says you did.

This is not a minor rounding error. In a volatile market with tight stops, this discrepancy compounds across hundreds of trades. A backtest built on candle-close evaluation can show a consistently profitable strategy that loses money in production specifically because of stop-loss misses.

The fix: backtest from 1-second base candle data and build higher timeframes from it. If your strategy runs on 1-hour candles, you construct those hour candles from 3,600 one-second candles. Any intracandle event – a stop trigger, a take profit hit, a trailing stop activation – gets caught at the second it actually occurred, not smoothed away at the close.

This is what unCoded's backtesting engine does. It's not a performance feature. It's a correctness feature. Every other approach produces numbers you can't trust.


The Sharpe ratio problem that almost everyone gets wrong

Sharpe ratio is the standard risk-adjusted performance metric. Higher is better. A Sharpe of 2.0 is considered good. A Sharpe of 3.0 is excellent.

Here's the problem: most retail backtesting tools calculate it wrong.

Sharpe ratio requires annualization. You take the average return per period, divide by the standard deviation of returns per period, and multiply by the square root of the number of periods per year. That last number – the annualization factor – depends entirely on your timeframe.

A strategy running on 1-minute candles has 525,600 periods per year. The annualization factor is √525,600 ≈ 725.

A strategy running on daily candles has 365 periods per year. The annualization factor is √365 ≈ 19.

If you apply the daily annualization factor to a 1-minute strategy, you divide by 19 instead of 725. Your Sharpe looks 38x higher than it actually is. A strategy with a real Sharpe of 0.3 displays as 11.4. You deploy it thinking it's exceptional. It isn't.

I've seen this mistake in commercial backtesting tools. I've seen it in open source libraries. I made it myself early on.

The unCoded backtesting engine uses a lookup table of correct bars-per-year values for every supported timeframe from 1 minute to monthly. It's a simple fix once you know the problem exists. The damage it causes before you know is real.


How the major platforms handle backtesting – and where they fall short

Most traders discover backtesting problems through one of the big platforms. So it's worth being specific about what each one actually does.

3Commas offers a backtesting function on paid plans. It evaluates at candle close. There's no documented annualization methodology for Sharpe ratio and the output metrics are limited to basic P&L. For simple DCA or Grid strategies where intracandle events rarely matter, it's functional. For anything with tight stops or high-frequency logic, the candle-close limitation means the numbers are optimistic.

Cryptohopper includes what it calls "backtesting and paper trading." The backtesting component runs on OHLCV candle data – meaning open, high, low, close, volume per candle, no intracandle resolution. Cryptohopper's documentation states it uses historical data to simulate trades but doesn't specify how stop losses are evaluated within candles. In practice, this produces the same issue: stops that would have triggered on wicks are missed.

HaasOnline is the most technically serious of the cloud-hosted platforms. HaasScript gives experienced users real control over strategy logic and the backtesting engine is more sophisticated than most retail options. It's still cloud-hosted (your API keys on their servers) and still subscription-based at prices significantly higher than most alternatives. For developers who want depth without self-hosting, it's the strongest option in the space. The backtesting accuracy is better than 3Commas or Cryptohopper – though still not built on tick or 1-second resolution as standard.

Coinrule is designed for no-code simplicity and the backtesting reflects that. It tests rules against historical data but the documentation is minimal on methodology. For beginners testing simple if-then rules it gives directional feedback. For anything requiring precise stop-loss evaluation or performance metrics beyond basic P&L it's insufficient.

Freqtrade is the open-source option worth mentioning. It's Python-based, fully customizable, and the backtesting engine is more honest than most commercial alternatives – it documents its assumptions, accounts for fees, and produces proper performance metrics including Sharpe. The limitation: it requires significant technical setup, Python knowledge, and ongoing maintenance. For developers it's a legitimate choice. For traders who don't code it's not accessible.

TradeSanta and Bitsgap both offer basic backtesting on their grid and DCA strategies. Neither publishes detailed methodology. Both evaluate at candle close.

The pattern across all of them: backtesting is presented as a feature, not a foundation. The goal is to let users test configurations quickly. The goal is not to produce numbers that accurately predict live performance. That gap is what costs traders money.

The distinction unCoded draws is simple: backtest on 1-second base candles so intracandle events are caught, annualize Sharpe correctly per timeframe, and account for fees in the exact asset they're paid in. These aren't advanced features. They're the minimum requirements for a backtest that tells the truth.


The fee problem that compounds silently

Backtests often either ignore fees entirely or apply them as a rough percentage to each trade.

The reality is more complicated.

Binance fees depend on your VIP tier, whether you're maker or taker, which quote asset you're trading, and whether you're paying in BNB. A standard 0.1% per side becomes 0.075% with BNB discount. At VIP1 it drops further. USDC pairs have different fee treatment than USDT pairs for EU users.

None of this sounds significant until you multiply it across a high-frequency strategy running hundreds of trades per week. The difference between 0.1% and 0.075% per side on 500 round-trip trades per month is 125 basis points of monthly performance. That's not noise.

The deeper problem: fees come back in different assets. If you're trading BTC/USDT without BNB discount, your fee might be paid in BTC (the base asset). To calculate accurate P&L, you need to convert that BTC fee to USDT at the trade price, not at some approximated value. If your backtesting engine doesn't do this correctly, your P&L figures are wrong.

I rebuilt the fee calculation engine in unCoded specifically because the first version was wrong. It handled the quote-asset fee case correctly. It approximated the base-asset fee case. The approximation was close enough that I didn't notice for weeks. When I found it, the cumulative error across thousands of historical trades was significant enough to change how I evaluated several strategies.

The fix: account for fees in the exact asset they're paid in, convert to quote currency at the actual trade price, and apply the correct fee tier for your configuration. If your backtest doesn't do this, the performance numbers include phantom profits.


The order minimum problem that only shows up in production

Binance has a minimum order value of approximately $5 for most tokens. Some tokens are higher.

A backtest doesn't know this. It will execute a $3.50 order without complaint. In production, Binance rejects it.

If your strategy uses buy splits – dividing a position into multiple parts – and your investment amount divided by your split count produces any split below the minimum, those splits simply don't execute in live trading. Your backtest assumed full execution. Your live bot executed partially. The position is now sized differently than the model assumed, the split-level P&L calculations are off, and the strategy behaves unpredictably.

The unCoded dashboard has a real-time warning for this: if your configured investment amount and split count would produce any split below $6 (we use $6 to give headroom above the $5 minimum), the dashboard flags it before deployment. It's a two-line calculation. It prevents a specific category of "why is my bot doing something weird" questions that are impossible to diagnose without knowing this edge case exists.


What execution quality actually means in practice

All of the above – slippage, intracandle stop triggers, fee accuracy, order minimums – is part of a broader concept: execution quality. The gap between what your model assumed would happen and what actually happened in the market.

In institutional trading, execution quality is a dedicated research area. Transaction cost analysis. Smart order routing. Post-trade analytics comparing actual fills to theoretical benchmarks.

At the retail level, most people don't think about it at all. They look at the backtest equity curve, see a nice upward slope, and deploy.

The strategies that survive production are the ones that were stress-tested against realistic execution assumptions. Not the idealized version where every order fills at the theoretical price with no friction. The realistic version where stops trigger on wicks, fees compound, and minimum order sizes create gaps between intended and actual position sizing.

The backtest is a hypothesis. Production is the test. The more honestly your backtest models reality, the fewer surprises the test produces.


What this means for building a bot you can trust

After rebuilding the backtesting engine twice, here's the mental model I use now.

A backtest should be pessimistic by default. Assume worse fills than your model predicts. Assume fees at the higher end of your actual fee range. Assume stops trigger at the stop price, not at the candle close. If the strategy is still profitable under pessimistic assumptions, it has a chance in production. If it only works under optimistic assumptions, it doesn't really work.

Then test on out-of-sample data. If you optimized parameters on 2022–2023 data, test on 2024 data before deploying. If the strategy falls apart on data it was never trained on, you optimized to history, not to the underlying market dynamic.

Then start small. The first two weeks of live trading are when production teaches you what the backtest missed. Small capital means those lessons are cheap.

The goal is not a backtest that looks good. The goal is a backtest that tells the truth.


Felix is the founder of unCoded – a self-hosted, non-custodial Binance Spot trading bot with a backtesting engine built on 1-second base candle data, correct Sharpe annualization per timeframe, and full fee accounting in the actual asset paid. If this article changed how you think about backtesting, the documentation is at uncoded.ch/docs.

uncoded.ch — ArrowTrade AG, Switzerland