With both approaches, you want to show that certain claims hold about the software you are testing. With case-based testing, the claims are implicit (implied by your cases), and the cases are explicit. With property-based testing, the claims are explicit (your property specifications) and the cases are implicit (being some fuzzed subset of the claim's domain).
Property-based testing gives approximate checks of a universal property, i.e. "these assertions hold for all inputs" (the check is a double-negative: are we unable to find a counter-example?)
I've found that many developers dismiss more sophisticated testing/checking approaches as being 'too academic' (e.g. "we don't need formal proofs"), 'too limited', 'impractical for real problems', etc. I like the above characterisation, since it turns that on its head: we can legitimately argue that case-based testing is too limited (it can only express existential properties, when we usually want universal ones) and also too academic (it gives formal proofs, which "we don't need").
Which would be great if it wasn't too computationally expensive. Most of the time, it just becomes generating a subset of random (within range) input values to be tested thus decreasing the utility.
We need better structured code such that each tests can be tied to specific code - code doesn't change - test doesn't re-run.
That is way too hard and too risky using current tech and unit testing doesn't solve the problem.
> Which would be great if it wasn't too computationally expensive. Most of the time, it just becomes generating a subset of random (within range) input values to be tested thus decreasing the utility.
Those are complaints about the testing/checking process; I was talking about test/property definitions.
In particular, most of the properties we care about are universal ("for all"), but unit tests cannot even state those properties, let alone test them, let alone prove them. Property-based testing lets us state universal properties, and it lets us test them (although it cannot prove them). Those are two advantages compared to unit/functional tests.
Regarding the actual checking process, and the cost/benefit it provides, that varies depending on the project's needs, the tools/frameworks used (e.g. random generation versus enumeration versus on-demand generation, etc.), the way testing is implemented (e.g. constructive generation versus filtering, domain-specific types versus general-purpose language builtins, etc.).
Regardless, the hard part with pbt is generating the inputs so they cover real cases and dont accidentally systematically miss some (this is often achieved by choosing different testing boundaries). The hard part with fuzzing is knowing you've done enough that you've hit many real data flows.
If we were to imagine a simplified landscape in which all tests fall on a one-dimensional axis, from "quick unit tests" on the left to "slow integration tests" on the right , fuzzing is even further off to the right.
Fuzzing is often very slow and operates on an entire black-box program, trying to understand and break it. Access to source code isn't always needed.
My understanding may be incomplete; please correct me if I missed something.
In PBT you generate inputs that exercise branching and value range conditions for a known subset of target variables or other internal states. The targets being tested are explicit.
Or, to quote David on his visit to our office: "Hypothesis is basically a python bytecode fuzzer."
This is especially useful for codebases with a lot of assertions (e.g. to capture relationships that are too awkward to rule out using type-checks)
My take: don't shy away from the logic perspective of what properties are, because it's the simplest explanation. A property is a logical proposition, e.g. "p implies q". Property based testing randomizes values of p and q as a proxy for all possible values, since exhaustive testing is practically impossible in almost all cases. The hypothesis is that running these tests for long enough will eventually expose a bug, since more and more of the input space gets generated over time.
PBT is an addicting tool. I used it on a personal project involving a lot of date filtering, and it found so many edge cases related to date parsing and timezone handling. In my experience, you just also write way less test code since the test library is generating the test cases for you.
The two major downsides are the difficulty in expressing useful properties, and test runtime. Regarding expressing properties, it is difficult, but I wonder if it can't be improved with practice. Hillel Wayne wrote a bit about properties here: https://www.hillelwayne.com/contract-examples/
Regarding the test runtime - it's just something to be wary of. You can easily start running tens of thousands of test cases depending on how you configure the library. It makes things like testing against a real database really impractical, unless you really control which inputs get generated, which defeats the purpose of random data generation.
I think parallelization and non-blocking test suites are the answer here. Locally and in CI, they can be run with a reasonable N so that obvious errors hopefully get caught, but large N's get saved for an async test suite that can benefit from high parallelization.
My last point about them is, even after all their benefit, example-based tests are still easier to write and serve their purpose. They also allow you to test specific edge cases explicitly.
Like everything else, PBT is just another tool in the shed.
Whether that's practical or not depends on your tooling and your resources. If you're willing to spend US$100k on a test machine, and some people are, you can have a 4096-node test cluster.
But testing is a tool for reliability to be used in conjunction with so many other tools like, you guessed it, typing.
In 2016 when this article was written, typing wasn't yet popular in Python. But now it is, and indeed more popular than property testing will ever be.
But Hypothesis doesn't support typing very well, though they do attempt . But just take a look at that list of caveats. It is a hack job in comparison.
Even if you don't generate your Strategies from types, Hypothesis requires you to know whether the type you construct is internally a sum type or a product type.  The testing framework shouldn't have to know these details, this is unnecessary complexity.
If you think the QuickCheck strategy of having lawless unnamed instances is bad, there is now a Haskell library where Gens are values, like they are in Hypothesis (where they are called Strategies). It probably became popular only after 2016, which could be why the article doesn't mention it. Hypothesis also supports "registering" strategies so that they are auto-discovered on-demand. It's done at runtime of course, unlike with QuickCheck, where it is type class based, as previously mentioned.
I applaud the article for cementing that property checking can work without typing, but I worry that readers will conclude that Haskell does it wrong, which is a false implication. You don't have to use ad-hoc polymorphism for discovering strategies.
These just seem like the worst kinds of test, highly coupled with implementation details. For the benefit of finding some bugs with edge cases (which may or may not be spurious), I just don't believe it's worth the effort, given that you still need all your normal tests anyway. Maybe I am just unlucky not to work in a more pure, mathsy domain, I dunno.
I applied the following techniques:
(1) Generate random valid well defined programs and see (a) if they crash the compiler, (b) cause different CL implementations to produce different outputs, (c) when modified (by addition of randomly generated optimization directives or valid type declarations) they still produce the same output. This is differential testing, which was used by McKeeman at DEC in the 1990s to test C compilers, and later improved (again, on C compilers) by Yang, Chan, Eide, and Regehr at U. of Utah (csmith, creduce).
Since a running lisp image can generate and compile functions internally (this IS lisp, after all), the testing loop can be very fast. Since 2003 I have run this on and off for many billions of iterations on desktop and laptop machines on various CL implementations, now mainly on SBCL. Most of the test input reduction is handled automatically, which is a big help.
(3) Extensive fuzzing of calls to standard functions in CL, using random generation of input values and random generation of valid type declarations, with the invariant being that the same values should be computed (and the compiler not fail.) This is a specialization of (1), but was sufficiently different that the bugs it found were not the same.
Examples of tests produced by (1) and (3) over an early period when this was being developed. Each caused a failure in some CL implementation: https://github.com/pfdietz/ansi-test/blob/master/misc/misc.l...
This was one of my favorite bugs found: https://bugs.launchpad.net/sbcl/+bug/1730699
The experience with this sort of testing of compilers (in any language) is that if the compiler (free or commercial) has never been subjected to it, it will immediately find bugs in the compiler.
At the leaf nodes, generate constants or variables. Keep track of the variables that are in scope as you do this (and their types.)
This is a pretty standard approach used in compiler testing. In languages like C you have problems ensuring the program doesn't cause undefined behavior, but Common Lisp is much nicer in that respect -- for example, evaluation order of arguments to functions is defined, and you don't have to worry about integer operations overflowing.
All this has made me appreciate language specifications that really nail down what programs are supposed to do.
The forms generated are constrained if one wants them to be easily reducible while preserving validity. Hypothesis has a nice solution to that problem, which I didn't use, as this was done a decade and a half earlier.
Properties as external assertions that you code up as a test. An easy example is to think about the contract of a stack, and the properties that result, and how you could test those properties with arbitrary example: pushing arbitrary t of T on a stack of T results in a stack 1 deeper popping a non-empty stack results in a stack 1 shallower, etc.
One approach is to make values to represent the "actions" we care about (e.g. "increment", "sendMessage", "delete", etc.), and write tests which take random lists of actions as arguments and execute those actions.
We can calculate the expected result by either using a non-threaded 'oracle', or by generating actions with a known result (e.g. if we always generate equal numbers of "increment" and "decrement" actions, then the result should match the original value).
As an example, I wrote a dynamic binding mechanism based on thread-local storage, which required special handling of thread pools to ensure futures wouldn't see stale values. To test this, I made a tiny domain-specific language (AKA a handful of classes) defining futures/awaiting, dynamic binding/lookup, and string literals/appending. I made two interpreters (AKA recursive functions) for this DSL: one using real futures/threads/dynamic-binding, the other just a straightforward calculation with a couple of stacks. To test the correctness of my dynamic binding mechanism in the face of futures/threading, I simply test that randomly-generated programs in that DSL will give the same result from both interpreters. (I also wrote a few properties to test each interpreter too!)
What's a "normal test"? I tend to only write property-based tests (i.e. tests with parameters, which get run many times with random inputs).
Tests without parameters tend to either be trivial, or involve a bunch of setup that would be better turning into parameters.
yes, the test code is an “oracle”, knowing what the service _should_ be doing. it’s a parallel implementation of the logic, highly geared towards the use-case of testing.
The two most recent (entirely typical) examples Im thinking of being third party systems behaving unexpectedly and a "bug in requirements".
I can see that where it might be effective would be in finding obscure edge cases in very complex logical code with simple inputs and outputs - e.g. a DSL parser or a sophisticated pricing engine. Weirdly it seems I do less and less of that kind of thing every year.
What is property based testing? - https://news.ycombinator.com/item?id=14017578 - April 2017 (33 comments)
In praise of property-based testing (2019) - https://news.ycombinator.com/item?id=26680060 - April 2021 (20 comments)
Effective Property-Based Testing - https://news.ycombinator.com/item?id=26017341 - Feb 2021 (34 comments)
Property-Based Testing in Python - https://news.ycombinator.com/item?id=26010998 - Feb 2021 (2 comments)
Teaching and Assessing Property-Based Testing - https://news.ycombinator.com/item?id=25738436 - Jan 2021 (3 comments)
An introduction to property-based testing with QuickCheck - https://news.ycombinator.com/item?id=25457744 - Dec 2020 (3 comments)
In praise of property-based testing - https://news.ycombinator.com/item?id=21880635 - Dec 2019 (9 comments)
Property-Based Testing with PropEr, Erlang, and Elixir - https://news.ycombinator.com/item?id=18894018 - Jan 2019 (8 comments)
Mysteries of Dropbox: Property-Based Testing of a Synchronization Service [pdf] - https://news.ycombinator.com/item?id=10627666 - Nov 2015 (3 comments)
A new approach to property based testing - https://news.ycombinator.com/item?id=10169098 - Sept 2015 (9 comments)
Hypothesis 1.0: A property-based testing library for Python - https://news.ycombinator.com/item?id=9291015 - March 2015 (30 comments)