探索Java中的ThreadLocalRandom

sofia 3年前 (2023-12-04) Java系列 | 抢沙发 1687

文章评分 0 次，平均分 0.0 ：

[收起] 文章目录

线性同余方法
不再有CAS
删除间接性
高效的内存访问
填充
Contended注解
结论

Java 7引入了ThreadLocalRandom，以提高高竞争环境中的随机数生成吞吐量。

ThreadLocalRandom背后的原理很简单：每个线程都维护自己版本的Random，而不是共享一个全局Random实例。这反过来又减少了争用，从而提高了吞吐量。

既然这是一个如此简单的想法，我们应该能够卷起袖子，以类似的性能实现ThreadLocalRandom之类的东西，对吧？

在我们的第一次尝试中，我们将使用ThreadLocal<Random>这样简单的东西：

// A few annotations
public class RandomBenchmark {

    private final Random random = new Random();
    private final ThreadLocal<Random> simpleThreadLocal = ThreadLocal.withInitial(Random::new);

    @Benchmark
    @BenchmarkMode(Throughput)
    public int regularRandom() {
        return random.nextInt();
    }

    @Benchmark
    @BenchmarkMode(Throughput)
    public int simpleThreadLocal() {
        return simpleThreadLocal.get().nextInt();
    }

    @Benchmark
    @BenchmarkMode(Throughput)
    public int builtinThreadLocal() {
        return ThreadLocalRandom.current().nextInt();
    }

    // omitted
}

在这个基准测试中，我们比较了Random、我们自己的ThreadLocal<Random>和内置的ThreadLocalRandom：

Benchmark                             Mode  Cnt           Score          Error  Units
RandomBenchmark.builtinThreadLocal   thrpt   40  1023676193.004 ± 26617584.814  ops/s
RandomBenchmark.regularRandom        thrpt   40     7487301.035 ±   244268.309  ops/s
RandomBenchmark.simpleThreadLocal    thrpt   40   382674281.696 ± 13197821.344  ops/s

ThreadLocalRandom每秒产生约10亿个随机数，使ThreadLocal<random>和普通随机方法相形见绌。

线性同余方法

到目前为止，目前使用的最流行的随机数生成器是D.H.Lehmer于1949年推出的线性同余伪随机数生成器。

该算法从初始种子值X0开始。然后根据前一个随机变量生成每个随机变量，如下所示：

Xn+1=（aXn+c）mod m

在现实世界中，只使用一个变量来实现这一点更为实际。因此，我们应该在每次生成数字时更新种子值。非常有趣的是，Random类中的next（int）方法负责这一点：

protected int next(int bits) {
    long oldseed, nextseed;
    AtomicLong seed = this.seed;
    do {
        oldseed = seed.get();
        nextseed = (oldseed * multiplier + addend) & mask;
    } while (!seed.compareAndSet(oldseed, nextseed));
    return (int)(nextseed >>> (48 - bits));
}

由于多个线程可能同时更新种子值，因此我们需要某种同步来协调并发访问。在这里，Java在原子学的帮助下使用了一种无锁方法。

基本上，每个线程都试图通过compareAndSet将种子值原子化地更改为新的种子值。如果线程未能执行此操作，它将重试相同的进程，直到能够成功提交更新为止。

当争用较高时，CAS故障的数量会增加。这就是Random在并发环境中性能不佳的主要原因。

不再有CAS

在基于ThreadLocal的实现中，种子值被限制在每个线程中。因此，我们不需要原子或任何其他形式的同步，因为没有共享的可变状态：

public class MyThreadLocalRandom extends Random {

    // omitted
    private static final ThreadLocal<MyThreadLocalRandom> threadLocal = 
        ThreadLocal.withInitial(MyThreadLocalRandom::new);

    private MyThreadLocalRandom() {}

    public static MyThreadLocalRandom current() {
        return threadLocal.get();
    }

    @Override
    protected int next(int bits) {
        seed = (seed * multiplier + addend) & mask;
        return (int) (seed >>> (48 - bits));
    }
}

如果我们再次运行相同的基准：

Benchmark                             Mode  Cnt           Score          Error  Units
RandomBenchmark.builtinThreadLocal   thrpt   40  1023676193.004 ± 26617584.814  ops/s
RandomBenchmark.lockFreeThreadLocal  thrpt   40   695217843.076 ± 17455041.160  ops/s
RandomBenchmark.regularRandom        thrpt   40     7487301.035 ±   244268.309  ops/s
RandomBenchmark.simpleThreadLocal    thrpt   40   382674281.696 ± 13197821.344  ops/s

MyThreadLocalRandom的吞吐量几乎是简单ThreadLocal<Random>的两倍。

compareAndSet提供了原子性和内存排序保证，这在线程受限的上下文中是不需要的。由于这些保证成本高昂且不必要，因此取消它们大大提高了吞吐量。

然而，我们仍然落后于内置的ThreadLocalRandom！

删除间接性

事实证明，每个线程本身不需要一个独特而完整的Random副本。它只需要最新的种子值，就这样。

从Java 8开始，这些值已添加到Thread类本身：

/** The current seed for a ThreadLocalRandom */
@jdk.internal.vm.annotation.Contended("tlr")
long threadLocalRandomSeed;

/** Probe hash value; nonzero if threadLocalRandomSeed initialized */
@jdk.internal.vm.annotation.Contended("tlr")
int threadLocalRandomProbe;

/** Secondary seed isolated from public ThreadLocalRandom sequence */
@jdk.internal.vm.annotation.Contended("tlr")
int threadLocalRandomSecondarySeed;

每个线程实例都在threadLocalRandomSeed变量中维护其当前种子值，而不是使用MyThreadLocalRandomSet之类的方法。结果，ThreadLocalRandom类变成了一个单例：

/** The common ThreadLocalRandom */
static final ThreadLocalRandom instance = new ThreadLocalRandom();

每次我们调用ThreadLocalRandom.current（）时，它都会惰性地初始化threadLocalRandomSeed，然后返回那个信号：

public static ThreadLocalRandom current() {
    if (U.getInt(Thread.currentThread(), PROBE) == 0)
        localInit();
    return instance;
}

哈希表查找

使用MyThreadLocalRandom，对current（）工厂方法的每次调用都会转换为ThreadLocal实例的哈希值计算和底层哈希表中的查找。

相反，使用这种新的Java 8+方法，我们所要做的就是直接读取threadLocalRandomSeed值，然后更新它。

高效的内存访问

为了更新种子值，java.util.courrent.ThreadLocalRandom需要更改java.lang.Thread类中的threadLocalRandomSeed状态。如果我们公开该状态，那么每个人都可能更新threadLocalRandomSeed，这就不是那么好了。

我们可以使用反射来更新非公共状态，但仅仅因为我们可以，并不意味着我们应该！

事实证明，ThreadLocalRandom使用Unsafe.putLong本机方法有效地更新threadLocalRandomSeed状态：

/**
* The seed increment.
*/
private static final long GAMMA = 0x9e3779b97f4a7c15L;
private static final Unsafe U = Unsafe.getUnsafe();
private static final long SEED = U.objectFieldOffset(Thread.class, "threadLocalRandomSeed");

final long nextSeed() {
    Thread t; 
    long r; // read and update per-thread seed
    U.putLong(t = Thread.currentThread(), SEED, r = U.getLong(t, SEED) + GAMMA);

    return r;
}

基本上，putLong方法将r值写入相对于当前线程的某个内存地址。已通过调用另一个本机方法（即Unsafe.objectFieldOffset）计算出的内存偏移量。

与反射相反，所有这些方法都有本机实现，并且非常高效。

伪共享

CPU缓存是根据缓存行工作的。也就是说，缓存行是CPU缓存和主存储器之间的传输单位。

基本上，处理器倾向于将一些其他值与请求的值一起缓存。这种空间局部性优化通常提高了内存访问的吞吐量和延迟。

然而，当两个或多个线程竞争同一缓存行时，多线程可能会产生反作用。

为了更好地理解这一点，我们假设以下变量位于同一缓存行中：

public class Thread implements Runnable {
    private final long tid;
    long threadLocalRandomSeed;
    int threadLocalRandomProbe;
    int threadLocalRandomSecondarySeed;

    // omitted
}

一些线程将tid或线程id用于某种未知目的。现在，如果我们更新线程中的threadLocalRandomSeed来生成一个随机数，那么应该不会发生任何错误，对吧？这听起来并不是什么大不了的事情，因为一些线程正在读取tid，而另一个线程则写入另一个完整的内存位置。

不管我们怎么想，由于所有这些值都在同一个缓存行中，读取线程将遇到缓存未命中。写入程序需要刷新其存储缓冲区。这种现象被称为伪共享，会对我们的多线程应用程序造成性能打击。

填充

为了避免伪共享问题，我们可以在争用的值周围添加一些填充。这样，这些高度争用的值中的每一个都将驻留在它们自己的缓存行上：

public class Thread implements Runnable {
    private final long tid;
    private long p11, p12, p13, p14, p15, p16, p17 = 0; // one 64 bit long + 7 more => 64 Bytes

    long threadLocalRandomSeed;
    private long p21, p22, p23, p24, p25, p26, p27 = 0;

    int threadLocalRandomProbe;
    private long p31, p32, p33, p34, p35, p36, p37 = 0;

    int threadLocalRandomSecondarySeed;
    private long p41, p42, p43, p44, p45, p46, p47 = 0;

    // omitted
}

在大多数现代处理器中，缓存行大小通常为64或128字节。在我的机器中，它是64个字节，所以我在tid声明之后又添加了7个伪长值。

通常，这些threadLocal*变量将在同一个线程中更新。所以我们最好只隔离tid：

public class Thread implements Runnable {
    private final long tid;
    private long p11, p12, p13, p14, p15, p16, p17 = 0;

    long threadLocalRandomSeed;
    int threadLocalRandomProbe;
    int threadLocalRandomSecondarySeed;

    // omitted
}

读取器线程不会遇到缓存丢失，编写器也不需要立即刷新它的存储缓冲区，因为这些局部变量不易失。

Contended注解

jdk.internal.vm.annotation.Contended注释（如果使用Java 8，则为sun.msc.Contented）是JVM隔离注释字段以避免伪共享的提示。因此，现在以下内容应该更有意义：

/** The current seed for a ThreadLocalRandom */
@jdk.internal.vm.annotation.Contended("tlr")
long threadLocalRandomSeed;

/** Probe hash value; nonzero if threadLocalRandomSeed initialized */
@jdk.internal.vm.annotation.Contended("tlr")
int threadLocalRandomProbe;

/** Secondary seed isolated from public ThreadLocalRandom sequence */
@jdk.internal.vm.annotation.Contended("tlr")
int threadLocalRandomSecondarySeed;

借助ContendedPPaddingWidth调整标志，我们可以控制填充宽度。