By Felix Götz – Co-Founder and CTO of ArrowTrade AG, building unCoded since 2016 in crypto trading.

Before we start: What follows is real backtest data from unCoded's public testing infrastructure, not marketing material. Every number in this article is verifiable at uncoded.ch/backtesting. The point isn't to show how good any specific strategy is – it's to show how to evaluate whether any strategy you're considering actually works, or just happened to work during the period it was tested.

I get this question constantly.

"My bot strategy showed +180% in backtest. Why does it lose money in live trading?"

The answer is almost always the same, and it's not what most retail bot users expect: the strategy probably never worked. It just happened to look profitable during the specific window it was tested in. When you deployed it in different conditions, the underlying weakness became visible.

This is the single most common failure pattern in retail bot trading. People test a strategy on 2023 data, see beautiful results, and assume those results predict anything about 2024 or 2025. They don't.

Here's how to actually tell whether a trading bot strategy works, with the data to prove why most of them don't.

The fundamental question

Before any other evaluation, every strategy needs to answer one question:

"Does this strategy work because of skill, or because the test period happened to be favorable?"

If you can't tell the difference, you're guessing. And guessing with capital deployed on automated systems is the fastest known way to lose money in retail trading.

The good news: there are specific, well-established methods for separating real strategy edge from luck. Most retail bot platforms don't use them, because the methods reveal uncomfortable truths about strategy robustness. The methods exist anyway.

What "out-of-sample testing" actually means

The academic standard for separating skill from luck is called out-of-sample testing. It's been the gold standard in quantitative finance for decades, but almost nobody in retail crypto bot trading uses it correctly.

Here's the basic idea:

In-sample data is the historical period you used to develop, refine, and optimize your strategy. You looked at this data, tweaked parameters until it worked, and arrived at your final configuration.

Out-of-sample data is a separate period you set aside and never looked at during development. After your strategy is finalized, you test it on this fresh data without making any further changes.

If the strategy works on both periods, the edge is more likely real. If it only works on the in-sample period, you almost certainly overfit – tuned the strategy to noise in your training data rather than to actual market patterns.

A 2014 study found that 44% of published trading strategies fail to replicate their success when applied to new data. That number is for academically-published strategies. For retail bot strategies optimized in commercial backtest environments, the failure rate is almost certainly higher.

This is why a strategy that returns 250% in your backtest can produce 5% – or -30% – in live trading. The backtest period was your in-sample data. Live trading is permanent out-of-sample data. The gap between the two is where overfit strategies die.

What real out-of-sample testing looks like in crypto

Most retail backtest tools let you test one token, one strategy, one time period. That's it.

This is structurally insufficient for evaluating whether a strategy works. To know if a strategy generalizes, you need to test it across:

Multiple time periods. A strategy that works in 2023 needs to also work in 2024 and 2025. If the same configuration produces wildly different results across years, the "working" year was probably regime luck, not strategy edge.

Multiple tokens. A strategy that works on Bitcoin needs to also work on tokens with different liquidity profiles, different volatility characteristics, and different market structures. If it only works on a handful of cherry-picked tokens, it's not a strategy – it's a coincidence.

Multiple market regimes. A strategy that works in trending markets needs to be evaluated against ranging markets and bear markets too. If it only works when conditions are favorable, you'll deploy it expecting one outcome and get a different one when conditions change.

This is the testing methodology that separates real strategies from accidental performance. Almost no retail platform does it properly, because the results expose how fragile most "winning" strategies actually are.

A real example: one strategy, 287 tokens, three years

unCoded publishes its full backtest distribution for the BasicMode strategy across multiple years and the entire Binance Spot market. Same configuration. Same parameters. Same fees. Only the year and the token change.

This is what real out-of-sample testing looks like in practice.

BasicMode in 2023

Tested across 191 token configurations:

Profitable runs: 150 out of 191 (78.5%)
Average return: +18.08%
Median return: +21.52%
Best: ORDIUSDT at +93.06%
Worst: QUICKUSDT at -99.45%
Total profit across all runs: +659,442 USDT

If you'd seen this data in 2023 and concluded "BasicMode works," you would have been correct based on what you saw. The strategy was profitable on the vast majority of tested tokens. Median return was positive. Most outcomes were good.

A marketing department could legitimately call this "+18% average return strategy" without lying.

BasicMode in 2025

Same strategy. Same parameters. Tested across 373 token configurations.

Profitable runs: 29 out of 373 (7.8%)
Average return: -62.33%
Median return: -73.83%
Best: BCHUSDT at +43.46%
Worst: USUALUSDT at -97.03%
Total profit across all runs: +598,738 USDT (despite negative percentage returns – more on this below)

Same strategy. Different year. Profitability rate dropped from 78.5% to 7.8%. Median return went from +21.52% to -73.83%.

This is what a strategy looks like when it's tested honestly across regimes. The strategy didn't change. The market did. And the strategy's apparent profitability turned out to be heavily regime-dependent.

If you'd deployed BasicMode in late 2024 based on the 2023 data, you would have walked into a year where the same strategy lost money on 92% of tokens.

This is the difference between "the strategy works" and "the strategy worked in 2023."

Source for all data: uncoded.ch/backtesting

Why the 100% Win Rate paradox matters here

If you scan unCoded's backtest data, you'll notice something that looks impossible.

Almost every single run shows a "Win Rate" of 100.0%.

Including the runs that lost 75% of capital. ETHUSDT at -8.44% with 100% win rate. KSMUSDT at -75.58% with 100% win rate. USUALUSDT at -97.03% with 100% win rate.

A new reader's first reaction is reasonable: "100% win rate is impossible, this must be marketing fraud."

It's neither. And understanding why is critical for evaluating any bot strategy you're considering.

The mechanic

In unCoded's strategy structure, positions only close when they reach a positive sellPercentage above their entry. Every closed buy-and-sell cycle, by definition, closes profitably. So every closed trade cycle is a "win" in the literal mathematical sense.

But the strategy doesn't include traditional stop-loss exits in down markets. When price falls below entry, the strategy doesn't sell at a loss. It holds the position and accumulates more at lower prices, waiting for eventual mean reversion.

Why this design choice exists – the academic case

This isn't an arbitrary decision. Multiple academic studies have analyzed whether stop-loss strategies actually improve portfolio outcomes versus buy-and-hold approaches.

Kaminski and Lo (2014) developed the rigorous analytical framework for measuring how stop-losses affect returns. Their central finding: for assets following random walk processes, stop-loss strategies almost always produce lower expected returns than holding. The volatility reduction comes at a measurable performance cost.

Annaert et al. (2009) found that stop-loss strategies produce reduced returns compared to buy-and-hold, with the lower volatility being the only compensation – a tradeoff, not a clear win.

Bochuan Dai et al. (2017, ScienceDirect) is the most directly relevant: "For a simple random walk, the [stop-loss] policy always produces lower expected returns. For an AR(1) process, the policy improves performance in the case of momentum, but hurts performance in the case of mean reversion."

Crypto markets in ranging conditions exhibit strong mean-reversion characteristics. Tokens that drop 30% often retrace within weeks. A traditional stop-loss strategy in these conditions exits at the bottom and misses the recovery.

Where this design works and where it fails

Where it works:

Range-bound markets where mean reversion eventually occurs
Tokens with sufficient liquidity and survival probability
Sufficient capital reserves to keep accumulating without exhausting funds

Where it fails:

Sustained trending downmoves where price doesn't mean-revert (real bear markets)
Tokens in terminal decline (the strategy keeps accumulating into a zero)
Capital exhaustion before reversion happens
Low-liquidity tokens where even the bot's own trades push price 2%+ in single executions

The 2025 BasicMode data shows exactly what failure looks like: 100% win rate (every closed cycle profitable) alongside -75% average return (most positions never closed because mean reversion didn't happen during the test period).

The strategy worked as designed at the trade level. The market regime didn't allow positions to close at profit.

The takeaway for evaluating strategies

When any platform shows you "high win rate" without showing the corresponding return percentage, they're hiding the failure mode. A strategy with 100% win rate and -75% return is mathematically possible, real, and exactly what happened on dozens of tokens in 2025.

Win rate alone tells you nothing useful. Win rate alongside return percentage tells you the full story. Always demand both numbers from any bot platform you're evaluating.

The Alpha column: where backtest honesty gets uncomfortable

Here's another metric most retail backtests hide.

Alpha = strategy return minus buy-and-hold return on the same token over the same period.

It answers the critical question: "Did the bot actually add value, or did it just ride the market without doing anything special?"

A strategy that returns -50% on a token where buy-and-hold returned -75% has +25% alpha. The bot lost meaningful capital, but it lost less than holding would have. That's still useful information, but it's not "the bot was profitable."

Some real examples from the BasicMode 2025 data:

GNOUSDT: strategy +37.01%, buy-and-hold -52.08%, alpha +89.09%
(genuinely profitable AND outperformed massively)
AAVEUSDT: strategy -32.82%, buy-and-hold -52.72%, alpha +19.89%
(the bot lost money, but holders lost more)
GASUSDT: strategy -54.08%, buy-and-hold -54.53%, alpha +0.45%
(the bot did almost nothing better than holding)
FILUSDT: strategy -73.65%, buy-and-hold -73.79%, alpha +0.14%
(basically equal to holding, with extra fees)

This is the metric that separates marketing backtests from honest ones. Marketing shows you the +37% return on GNOUSDT. Honest data shows you the alpha next to it, so you can tell whether the bot earned that return or just got lucky on a token that rallied anyway.

Most backtests on the internet don't show alpha because alpha exposes when a strategy is just market beta in disguise.

⚠️ The five tests every strategy should survive

If you're evaluating any bot strategy – whether you built it yourself or you're considering deploying one from a marketplace – here are the five tests it needs to pass before you commit real capital.

Test 1: Multi-year consistency

Run the strategy across at least three different years that include at least one favorable and one unfavorable market regime. Compare the distribution of results.

Pass: Returns are positive across years, with reasonable variance. Maybe +20% in good years, +5% in bad years.

Fail: Returns are dramatically different between years. +180% in 2023, -65% in 2025. This is regime-dependence masquerading as strategy edge.

Test 2: Multi-token consistency

Run the strategy across all available tokens for the same time period. Look at the distribution of outcomes.

Pass: Strategy is profitable on the majority of tokens, with similar distribution across different liquidity tiers and volatility profiles.

Fail: Strategy is profitable on a small subset of cherry-picked tokens. If it only works on 5 out of 100 tokens, it's not a strategy – it's a coincidence with positive examples.

Test 3: Alpha versus buy-and-hold

For every test result, calculate alpha. Don't accept absolute returns alone.

Pass: Strategy produces meaningful positive alpha across most tokens and time periods. The bot is genuinely adding value beyond what holding would have produced.

Fail: Strategy returns track buy-and-hold returns closely. The "bot performance" is just market performance with extra fees.

Test 4: Realistic execution conditions

Verify that the backtest accounts for actual fees, slippage, intracandle execution, and liquidity constraints.

Pass: Backtest evaluates strategy logic on tick-level or 1-second base candle data, calculates fees in the asset they're actually paid in, and includes slippage modeling.

Fail: Backtest evaluates only on candle close, assumes zero slippage, or calculates fees as a flat percentage without considering fee currency. Live performance will not match.

Test 5: Failure mode disclosure

Understand exactly when and how the strategy fails. Every strategy has failure conditions. If a platform can't tell you what they are, they don't actually understand their own strategy.

Pass: The platform can describe specific market conditions where the strategy underperforms, and provides data showing how badly it underperforms in those conditions.

Fail: The platform shows you only successful examples and can't articulate when the strategy breaks.

Why most strategies don't survive these tests

Run any popular marketplace strategy through these five tests and most of them fail. Here's why.

Most retail backtest tools test single tokens. Test 2 (multi-token consistency) immediately exposes strategies that worked on a specific cherry-picked token but generalize poorly.

Most marketplace strategies were optimized during specific market periods. Test 1 (multi-year consistency) exposes strategies that worked in the 2021 bull market but die in 2025 conditions.

Most retail platforms don't show alpha. Test 3 exposes "strategies" that are just market beta with bot wrapping. The bot didn't make money – the asset did.

Most retail backtests use candle-close evaluation. Test 4 exposes the systematic optimism baked into backtests that miss intracandle events.

Most marketing-driven platforms hide failure modes. Test 5 exposes the gap between marketing claims and operational reality.

A strategy that passes all five tests is rare. A strategy that the platform openly admits fails some of them, but documents exactly which conditions cause the failures, is honest. A strategy with marketing claims of consistent profits across all conditions is almost certainly hiding something.

What to do with this information

If you're evaluating a bot platform or strategy:

Ask for the full distribution, not the highlights. If they can show you the result on AXSUSDT in 2021, ask to see the result on the same strategy across 100 different tokens in 2025. The gap between those two answers tells you everything.

Demand alpha, not just return. Any platform that won't show you alpha is hiding whether their "performance" is actually strategy edge or just market beta.

Insist on multi-year data. A strategy with 2023 results but no 2025 results is a strategy hiding its own failure mode. The platform either doesn't have current data (red flag) or has it and chose not to show you (bigger red flag).

Run the win rate/return cross-check. Any platform showing high win rates without showing the corresponding return percentage is hiding the strategy's structural failure mode.

Test on small capital before scaling. Even a strategy that passes all five tests can fail in unexpected ways during your specific deployment window. Start small. Validate. Scale only after evidence of live performance matching test expectations.

If you're building your own strategy, apply the same tests to your own work. The strategies that survive these tests are rare but valuable. The strategies that don't survive shouldn't be deployed regardless of how impressive their backtest looked.

The honest summary

Most crypto trading bot strategies don't actually work. They worked during the period they were tested in.

The difference between those two statements is the entire problem of retail bot trading. People deploy strategies expecting the test results to predict future performance. The strategies that survive multi-year, multi-token, multi-regime testing are rare. The strategies that don't survive are deployed anyway, and the resulting losses are blamed on "the market" or "bad luck" rather than on the inadequate testing methodology that led to the deployment.

The five tests above aren't original. They're standard quantitative finance practice that's been around for decades. The reason they're rare in retail crypto is that applying them rigorously exposes how few strategies actually have edge.

unCoded publishes its full distribution of backtest results across all available tokens and multiple years – including the years where strategies failed on the majority of tokens – because the profit-sharing pricing model only works if users actually profit. Misleading users into deploying overfit strategies hurts the platform's revenue directly. Other platforms can publish curated highlights because their subscription revenue doesn't depend on user outcomes.

The path to evaluating any bot strategy is the same path the academic literature has documented for thirty years: test out of sample, demand the full distribution, calculate alpha, account for realistic execution, and understand failure modes. Most retail platforms don't do any of these things. The few that do are easy to identify by their willingness to show you their bad years alongside their good ones.

If a platform won't show you the bad years, the strategy probably had bad years and the platform doesn't want you to see them.

That's the answer to "does this strategy actually work?" – usually no, sometimes yes, and the difference is always visible in the data the platform either does or doesn't publish.

Felix Götz is Co-Founder and CTO of ArrowTrade AG, the company behind unCoded — a self-hosted, non-custodial crypto Spot trading bot with profit-sharing pricing. All backtest data referenced in this article is publicly available at uncoded.ch/backtesting. Documentation at uncoded.ch/docs. ArrowTrade AG, Switzerland.

How to Tell If a Crypto Trading Bot Strategy Actually Works (Or Just Got Lucky in 2023)