Loading lots of Entities using Doctrine

Dec 23, 2016 · 660 words · 4 minute read php doctrine memory

Every once in a while you run into a situation where you need to load a lot (think thousands, not hundreds) of Doctrine entities from the database. For example when you want to create an export file, or when you want to apply some logic to all of them. The problem when you try this using the default implementation of Doctrine EntityRepository::findAll() is that it will load all entities from the database at once, oftentimes causing an out of memory error.

There is a relatively easy solution to this, using a Generator.

The first step to this solution is creating a custom repository for the entity you want to load. Say for example you have a Customer entity, you can change the annotation of your entity like so:

/**
 * @Entity(repositoryClass="My\Application\Repository\CustomerRepository")
 */
class Customer
{
    // code here ...
}

The same thing is possible with XML as well, if you prefer that.

The second step is implementing the findAll method in the CustomerRepostory class:

namespace My\Application\Repository;

class CustomerRepository extends Doctrine\ORM\EntityRepository
{
    public function findAll()
    {
        $queryBuilder = $this->createQueryBuilder('c')
            ->select('c')
            ->orderBy('c.id');
            
        $limit = 1000;
        $offset = 0;
            
        while (true) {
            $queryBuilder->setFirstResult($offset);
            $queryBuilder->setMaxResults($limit);
            
            $customers = $queryBuilder->getQuery()->getResult();
            
            if (count($customers) === 0) {
                break;
            }
            
            foreach ($customers as $customer) {
                yield $customer;
                $this->_em->detach($customer);
            }
            
            $offset += $limit;
        }
    }
}

with this code written, we can now do:

$repository = $entityManager->getRepository(Customer::class);
foreach ($repository->findAll() as $customer) {
    // do something with the customer
}

This will query the database in batches of 1000 entities at a time, yielding them 1 by 1, and detaching them from the EntityManager once they’ve been yielded, and thus are not needed anymore. When the entity is detached, nothing is referencing them anymore, so the garbage collector will be able to pick them up, freeing memory. Of course if your own code keeps a reference to the entity this will not work, and you will probably run into memory problems anyway.

Once the first of 1000 entities have been yielded, the next 1000 will be queried from the database and so on, until at last a query returns 0 results, indicating there are no more customers in the database and we are done.

Decorating the generator

A bonus here is that Generator implements Iterator, so you can easily decorate it with other iterators. Suppose you have a specification class (as discussed in Unbreakable Domain Models by Mathias Verraes) that says something about a customer, for example if the customer is a premium customer, and this cannot be expressed using SQL, because some external service needs to be called for example.

Now, if you need a list of all these customers, you can simply wrap our Generator in a CallbackFilterIterator, like so:

$repository = $entityManager->getRepository(Customer::class);
$generator = $repository->findAll();
$premiumCustomers = new CallbackFilterIterator(
    $generator,
    function (Customer $customer) {
        return CustomerIsPremiumSpecification::isSatisfiedBy($customer);
    },
);

foreach ($premiumCustomers as $premiumCustomer) {
    // do something with the premium customer
}

Obviously if the specification could be executed in SQL directly that would be a lot faster, but it cases where that’s not possible this is a nice alternative.

A few more things to keep in mind when applying this technique:

  • Make sure to order the query results in a way that is repeatable, otherwise you might get weird results, like getting a customer multiple times, or completely missing a customer.
  • By design a generator can only be iterated once. If you want to loop all customers multiple times you need to call findAll multiple times.
  • You cannot get the number of entities in a Generator using count(), you need to use iterator_get_count() instead. Once you’ve done that you can’t iterate it anymore, and you need to call findAll() again for a new Generator.
  • If your entity has any relations to other entities, and you need to use these other entities as well, you should add a leftJoin() or innerJoin() call to your query builder, so these entities can be eager loaded, resulting in fewer queries overall.
comments powered by Disqus