2026-03-08
If you list all device files with major number 1 on a Linux machine, you'll see something like this:
$ ls -la /dev/ | grep ' 1, '
crw-rw-rw- 1 root root 1, 7 Jan 18 20:01 full
crw-r--r-- 1 root root 1, 11 Jan 18 20:01 kmsg
crw-r----- 1 root kmem 1, 1 Jan 18 20:01 mem
crw-rw-rw- 1 root root 1, 3 Jan 18 20:01 null
crw-r----- 1 root kmem 1, 4 Jan 18 20:01 port
crw-rw-rw- 1 root root 1, 8 Jan 18 20:01 random
crw-rw-rw- 1 root root 1, 9 Jan 18 20:01 urandom
crw-rw-rw- 1 root root 1, 5 Jan 18 20:01 zeroEvery Linux system has them. You've used /dev/null a thousand times. You've probably read from /dev/urandom or /dev/random. But have you ever looked at what backs these files inside the kernel?
What makes these eight devices special is that they are among the simplest real drivers in Linux. There's no hardware to initialize, no DMA buffers to manage, no firmware to load. Just pure logic: "when userspace calls read(), do this." That makes them the perfect starting point for understanding how any Linux character device driver works.
We'll walk through their implementations line by line. Along the way, you'll see the interface every character device must conform to, the patterns that the kernel uses to dispatch I/O to the right handler, and the design choices that separate a minimal driver from a production-ready one. By the end, you'll have a concrete blueprint for writing your own.
All the code in this post comes from Linux 6.19 (drivers/char/mem.c and drivers/char/random.c). Let's start with open("/dev/null", ...) and trace it all the way down to the two-line C function that makes it work.
Look at the ls output again. Every file shares major number 1. In Linux, the major number identifies which driver handles a device. The minor number (1, 3, 4, 5, 7, 8, 9, 11) identifies which device within that driver.
The mapping from minor number to behavior lives in a single array in drivers/char/mem.c:
static const struct memdev {
const char *name;
const struct file_operations *fops;
fmode_t fmode;
umode_t mode;
} devlist[] = {
#ifdef CONFIG_DEVMEM
[DEVMEM_MINOR] = { "mem", &mem_fops, 0, 0 },
#endif
[3] = { "null", &null_fops, FMODE_NOWAIT, 0666 },
#ifdef CONFIG_DEVPORT
[4] = { "port", &port_fops, 0, 0 },
#endif
[5] = { "zero", &zero_fops, FMODE_NOWAIT, 0666 },
[7] = { "full", &full_fops, 0, 0666 },
[8] = { "random", &random_fops, FMODE_NOWAIT, 0666 },
[9] = { "urandom", &urandom_fops, FMODE_NOWAIT, 0666 },
#ifdef CONFIG_PRINTK
[11] = { "kmsg", &kmsg_fops, 0, 0644 },
#endif
};This is called a designated initializer array. The [3] = ... syntax means "put this entry at index 3." Indices 0, 2, 6, and 10 are left as zero-filled gaps.
A few things jump out:
file_operations *fops is the heart of any character device. It's a struct of function pointers (read, write, mmap, ioctl, ...) that define what happens when userspace performs operations on the file. Each device gets its own set.FMODE_NOWAIT on null, zero, random, urandom tells the kernel these devices support non-blocking I/O natively. Their operations never need to wait for hardware.mode 0 for mem and port means their permissions are set by udev, not by the driver. Look at the ls output again: both files are owned by root:kmem with mode 0640, while everything else is root:root. The kmem group exists specifically for programs that need to read kernel memory structures (like ps on older Unix systems or dmidecode). By assigning these devices to a dedicated group rather than making them root-only, administrators can grant raw memory access to specific tools without giving them full root privileges.#ifdef guards let /dev/mem, /dev/port, and /dev/kmsg be compiled out entirely on hardened kernels.Key pattern for your own driver: Your driver's behavior is defined entirely by which function pointers you put in your
file_operationsstruct. The kernel handles everything else (VFS lookup, file descriptor management, permission checks) and just calls your functions when userspace does I/O.
memory_open() dispatcherWhen userspace calls open("/dev/null", O_RDWR), the VFS resolves the device node, sees major 1, and calls the registered open handler. That handler is memory_open():
static int memory_open(struct inode *inode, struct file *filp)
{
int minor;
const struct memdev *dev;
minor = iminor(inode);
if (minor >= ARRAY_SIZE(devlist))
return -ENXIO;
dev = &devlist[minor];
if (!dev->fops)
return -ENXIO;
filp->f_op = dev->fops;
filp->f_mode |= dev->fmode;
if (dev->fops->open)
return dev->fops->open(inode, filp);
return 0;
}Read this carefully. This is the classic multiplexer pattern that appears throughout the kernel:
filp->f_op = dev->fops).open function, call it.Step 3 is the key insight. After memory_open() returns, the struct file now points to (say) null_fops instead of memory_fops. Every subsequent read(), write(), or close() call goes directly to /dev/null's handlers, without passing through the dispatcher again. The dispatcher runs exactly once, on open.
Key pattern for your own driver: If you have multiple sub-devices under one major number, use this multiplexer pattern. Register one
file_operationswhoseopenswaps in the real per-device operations. The rest of the VFS machinery works automatically.
At boot, chr_dev_init() registers the major number and creates device nodes:
static int __init chr_dev_init(void)
{
int retval;
int minor;
if (register_chrdev(MEM_MAJOR, "mem", &memory_fops))
printk("unable to get major %d for memory devs\n", MEM_MAJOR);
retval = class_register(&mem_class);
if (retval)
return retval;
for (minor = 1; minor < ARRAY_SIZE(devlist); minor++) {
if (!devlist[minor].name)
continue;
if ((minor == DEVPORT_MINOR) && !arch_has_dev_port())
continue;
device_create(&mem_class, NULL, MKDEV(MEM_MAJOR, minor),
NULL, devlist[minor].name);
}
return tty_init();
}
fs_initcall(chr_dev_init);register_chrdev() claims major number 1 and associates it with memory_fops (the dispatcher). The loop then calls device_create() for each non-empty slot in devlist, which triggers udev to create the /dev/ nodes.
Notice arch_has_dev_port(). On non-x86 architectures where I/O ports don't exist, /dev/port is never created. The tty_init() at the end is a historical artifact: the TTY subsystem was once part of the same driver family.
Now let's look at each device, from simplest to most complex.
/dev/null (minor 3): the universal data sinkThis is the single simplest driver in the Linux kernel:
static ssize_t read_null(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
return 0;
}
static ssize_t write_null(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
return count;
}That's it. read_null() returns 0, which the VFS interprets as end-of-file. write_null() returns count, telling the caller that all bytes were "successfully written," though nothing happened. These are the theoretical minimum implementations of read and write.
But /dev/null is more complete than you'd expect. It also supports vectored I/O, splice, and even io_uring:
static ssize_t write_iter_null(struct kiocb *iocb, struct iov_iter *from)
{
size_t count = iov_iter_count(from);
iov_iter_advance(from, count);
return count;
}
static int pipe_to_null(struct pipe_inode_info *info, struct pipe_buffer *buf,
struct splice_desc *sd)
{
return sd->len;
}
static int uring_cmd_null(struct io_uring_cmd *ioucmd, unsigned int issue_flags)
{
return 0;
}Why does write_iter_null() call iov_iter_advance() instead of just returning the count? Because the VFS contract requires that the iterator position reflects how many bytes were consumed. Without advancing it, callers using scatter-gather I/O would see inconsistent state. Even a "do nothing" driver must uphold the interface contract.
The seek function has a story behind it:
/*
* Special lseek() function for /dev/null and /dev/zero. Most notably, you
* can fopen() both devices with "a" now. This was previously impossible.
* -- SRB.
*/
static loff_t null_lseek(struct file *file, loff_t offset, int orig)
{
return file->f_pos = 0;
}It always resets the position to zero, regardless of what was requested. The source code comment explains why: this makes fopen("/dev/null", "a") work. Append mode internally performs a seek-to-end, which would fail on a device with no meaningful "end." By always succeeding (and returning 0), this edge case is handled gracefully.
Here's the full file_operations struct for /dev/null:
static const struct file_operations null_fops = {
.llseek = null_lseek,
.read = read_null,
.write = write_null,
.read_iter = read_iter_null,
.write_iter = write_iter_null,
.splice_write = splice_write_null,
.uring_cmd = uring_cmd_null,
};Key pattern for your own driver: You don't need to implement every operation in
file_operations. Only fill in what your device actually supports. The VFS returns-EINVALor-ENOTTYfor missing operations. But if your device will be used with modern I/O paths (io_uring, splice,readv/writev), you need the_itervariants. The legacyread/writecallbacks alone aren't enough.
/dev/zero (minor 5): infinite source of zeroesReading from /dev/zero produces an endless stream of null bytes:
static ssize_t read_zero(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
size_t cleared = 0;
while (count) {
size_t chunk = min_t(size_t, count, PAGE_SIZE);
size_t left;
left = clear_user(buf + cleared, chunk);
if (unlikely(left)) {
cleared += (chunk - left);
if (!cleared)
return -EFAULT;
break;
}
cleared += chunk;
count -= chunk;
if (signal_pending(current))
break;
cond_resched();
}
return cleared;
}This introduces two patterns you'll see in every well-written driver:
Chunked processing. The function works in PAGE_SIZE chunks, not in one giant operation. This keeps memory allocation bounded and ensures the kernel remains responsive.
Preemption points. After each chunk, signal_pending(current) checks whether the process has received a signal (like Ctrl-C), and cond_resched() yields the CPU if other tasks are waiting. Without these, a read(fd, buf, 1_000_000_000) call would lock up the CPU for the entire duration. This is critical for any device that can produce unbounded output.
clear_user() writes zeroes directly into the user's address space. It's the architecture-optimized equivalent of memset(buf, 0, chunk) that handles page faults and user-pointer validation internally.
Writes to /dev/zero are discarded, reusing /dev/null's implementation through preprocessor aliases:
#define write_zero write_null
#define write_iter_zero write_iter_null
#define splice_write_zero splice_write_nullThe most historically significant part of /dev/zero is its mmap support:
static int mmap_zero_private_success(const struct vm_area_struct *vma)
{
vma_set_anonymous((struct vm_area_struct *)vma);
return 0;
}
static int mmap_zero_prepare(struct vm_area_desc *desc)
{
#ifndef CONFIG_MMU
return -ENOSYS;
#endif
if (desc->vm_flags & VM_SHARED)
return shmem_zero_setup_desc(desc);
desc->action.success_hook = mmap_zero_private_success;
return 0;
}A MAP_PRIVATE mmap of /dev/zero creates an anonymous, zero-filled memory mapping. The kernel literally marks the VMA as anonymous via vma_set_anonymous(). A MAP_SHARED mmap creates shared memory backed by tmpfs. Before MAP_ANONYMOUS was added to Linux, this was the portable way to allocate zero-filled pages. You'll still see it in legacy code and in portable programs that target older Unix variants.
Here's how both modes look from userspace:
int fd = open("/dev/zero", O_RDWR);
/* Private mapping: zero-filled anonymous memory (the classic allocation trick) */
char *private = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
/* Shared mapping: memory visible across processes (IPC via tmpfs) */
int *shared = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);The private mapping is equivalent to the modern mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0). The shared mapping provides cross-process communication without touching the filesystem.
/dev/full (minor 7): the permanently full diskThis device exists for a single purpose: testing your error-handling code.
static ssize_t write_full(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
return -ENOSPC;
}
static const struct file_operations full_fops = {
.llseek = full_lseek,
.read_iter = read_iter_zero,
.write = write_full,
.splice_read = copy_splice_read,
};That's the entire device: reads return zeroes (reusing read_iter_zero from /dev/zero), and writes fail with ENOSPC ("No space left on device"). If your application writes to /dev/full and crashes or misbehaves, your disk-full error path is broken.
Notice the minimal file_operations: no write_iter, no splice_write, no mmap. A device only needs to implement operations that make sense for it.
/dev/port (minor 4): raw x86 I/O port accessThis device maps file offsets directly to x86 I/O port addresses:
static ssize_t read_port(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
unsigned long i = *ppos;
char __user *tmp = buf;
if (!access_ok(buf, count))
return -EFAULT;
while (count-- > 0 && i < 65536) {
if (__put_user(inb(i), tmp) < 0)
return -EFAULT;
i++;
tmp++;
}
*ppos = i;
return tmp-buf;
}
static ssize_t write_port(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
unsigned long i = *ppos;
const char __user *tmp = buf;
if (!access_ok(buf, count))
return -EFAULT;
while (count-- > 0 && i < 65536) {
char c;
if (__get_user(c, tmp)) {
if (tmp > buf)
break;
return -EFAULT;
}
outb(c, i);
i++;
tmp++;
}
*ppos = i;
return tmp-buf;
}Each byte read triggers an actual inb() instruction, a real I/O bus cycle to the specified port. The i < 65536 bound reflects the x86 I/O port address space (16-bit, ports 0x0000 through 0xFFFF). The entire implementation is wrapped in #ifdef CONFIG_DEVPORT, and creation is further gated by arch_has_dev_port() at boot. It only exists where I/O ports are a real hardware concept.
Like /dev/mem, /dev/port belongs to the kmem group (root:kmem, mode 0640). Reading or writing I/O ports is just as dangerous as accessing physical memory directly: a stray write to the wrong port can hang the machine or corrupt hardware state. Both devices share the same open_port() function, which requires CAP_SYS_RAWIO.
/dev/mem (minor 1): physical memory access/dev/mem is the most dangerous device in this family. It provides raw read/write access to physical memory. The file position is the physical address:
static ssize_t read_mem(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
{
phys_addr_t p = *ppos;
ssize_t read, sz;
void *ptr;
char *bounce;
int err;
if (p != *ppos)
return 0;
if (!valid_phys_addr_range(p, count))
return -EFAULT;The p != *ppos check is subtle. *ppos is a 64-bit loff_t, but phys_addr_t may be 32 bits on some architectures. If the cast truncates the value, the comparison fails and the function returns 0 instead of reading from the wrong physical address. A quiet safety net.
The core loop processes one page at a time, checking permissions for each page frame:
bounce = kmalloc(PAGE_SIZE, GFP_KERNEL);
if (!bounce)
return -ENOMEM;
while (count > 0) {
unsigned long remaining;
int allowed, probe;
sz = size_inside_page(p, count);
err = -EPERM;
allowed = page_is_allowed(p >> PAGE_SHIFT);
if (!allowed)
goto failed;
err = -EFAULT;
if (allowed == 2) {
/* Show zeros for restricted memory. */
remaining = clear_user(buf, sz);
} else {
ptr = xlate_dev_mem_ptr(p);
if (!ptr)
goto failed;
probe = copy_from_kernel_nofault(bounce, ptr, sz);
unxlate_dev_mem_ptr(p, ptr);
if (probe)
goto failed;
remaining = copy_to_user(buf, bounce, sz);
}page_is_allowed() returns a three-valued result: 0 = denied, 1 = full access, 2 = restricted (return zeroes instead). This is the CONFIG_STRICT_DEVMEM enforcement point. The "show zeroes" compromise lets tools like dmidecode still function without revealing actual RAM contents.
The bounce buffer is necessary because the physical address might point to memory-mapped I/O regions where direct copy_to_user() can't work. xlate_dev_mem_ptr() handles the architecture-specific virtual mapping, and copy_from_kernel_nofault() catches machine-check exceptions when probing I/O space.
Opening the device requires both CAP_SYS_RAWIO and passes a kernel lockdown check:
static int open_port(struct inode *inode, struct file *filp)
{
int rc;
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
rc = security_locked_down(LOCKDOWN_DEV_MEM);
if (rc)
return rc;
if (iminor(inode) != DEVMEM_MINOR)
return 0;
filp->f_mapping = iomem_get_mapping();
return 0;
}On Secure Boot systems, security_locked_down() blocks access entirely. Even root can't open /dev/mem. The iomem_get_mapping() call sets up a unified address space so the kernel can revoke mappings when a driver takes ownership of an MMIO region.
Key pattern for your own driver: For sensitive devices, layer your access controls.
capable()checks coarse capabilities, LSM hooks (security_locked_down()) enforce fine-grained policy, and hardware-level checks (page_is_allowed()) gate per-operation access.
/dev/random and /dev/urandom (minors 8 and 9): the kernel CSPRNGThese are the most complex devices in the family. Their implementations live in a separate file drivers/char/random.c, with file operations exported for the devlist table:
const struct file_operations random_fops = {
.read_iter = random_read_iter,
.write_iter = random_write_iter,
.poll = random_poll,
.unlocked_ioctl = random_ioctl,
.compat_ioctl = compat_ptr_ioctl,
.fasync = random_fasync,
.llseek = noop_llseek,
.splice_read = copy_splice_read,
.splice_write = iter_file_splice_write,
};
const struct file_operations urandom_fops = {
.read_iter = urandom_read_iter,
.write_iter = random_write_iter,
/* ... same ioctl, poll, fasync as random ... */
};Both devices share the same write path, ioctl handler, and poll. They differ only in their read path.
The RNG tracks its readiness as a state machine:
static enum {
CRNG_EMPTY = 0, /* Little to no entropy collected */
CRNG_EARLY = 1, /* At least POOL_EARLY_BITS collected */
CRNG_READY = 2 /* Fully initialized with POOL_READY_BITS collected */
} crng_init __read_mostly = CRNG_EMPTY;The system starts in CRNG_EMPTY. As entropy accumulates from interrupt timing, hardware RNG instructions, and device activity, it transitions through CRNG_EARLY (128 bits) to CRNG_READY (256 bits). The readiness check uses a static branch, a runtime-patched NOP/jump instruction. After initialization, checking readiness has zero overhead on the hot path:
static DEFINE_STATIC_KEY_FALSE(crng_is_ready);
#define crng_ready() (static_branch_likely(&crng_is_ready) || crng_init >= CRNG_READY)Under the hood, the CRNG uses fast key erasure with ChaCha20 for generation and BLAKE2s for entropy accumulation. Per-CPU keys eliminate lock contention on the hot path. The crypto internals are fascinating but beyond the scope of this post. What matters for the driver is the crng_ready() check, because it's what makes the two read paths behave differently.
/dev/random and /dev/urandom divergeThe entire behavioral difference between the two devices is in their read functions:
/dev/random blocks until the CRNG is fully initialized:
static ssize_t random_read_iter(struct kiocb *kiocb, struct iov_iter *iter)
{
int ret;
if (!crng_ready() &&
((kiocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) ||
(kiocb->ki_filp->f_flags & O_NONBLOCK)))
return -EAGAIN;
ret = wait_for_random_bytes();
if (ret != 0)
return ret;
return get_random_bytes_user(iter);
}/dev/urandom serves data immediately but logs a warning if the CRNG isn't ready:
static ssize_t urandom_read_iter(struct kiocb *kiocb, struct iov_iter *iter)
{
static int maxwarn = 10;
if (!crng_ready())
try_to_generate_entropy();
if (!crng_ready()) {
if (!ratelimit_disable && maxwarn <= 0)
ratelimit_state_inc_miss(&urandom_warning);
else if (ratelimit_disable || __ratelimit(&urandom_warning)) {
--maxwarn;
pr_notice("%s: uninitialized urandom read (%zu bytes read)\n",
current->comm, iov_iter_count(iter));
}
}
return get_random_bytes_user(iter);
}The try_to_generate_entropy() call is a last-ditch effort: the kernel fires a timer on another CPU and measures jitter in cycle counter readings to harvest entropy. The rate-limited warning (capped at 10 messages) avoids flooding the log on headless systems that read /dev/urandom early in boot.
Once the CRNG is initialized, both paths call the same get_random_bytes_user() and produce cryptographically identical output.
Writing to either device feeds data into the input pool but does not credit it as entropy:
static ssize_t random_write_iter(struct kiocb *kiocb, struct iov_iter *iter)
{
return write_pool_user(iter);
}This is a deliberate security decision. If unprivileged writes could credit entropy, an attacker could trick the system into believing it had sufficient randomness when it didn't. Only the RNDADDENTROPY ioctl (which requires CAP_SYS_ADMIN) can write and credit:
case RNDADDENTROPY: {
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
/* ... */
ret = write_pool_user(&iter);
/* ... */
credit_init_bits(ent_count);
return 0;
}This is how hardware RNG daemons like rngd feed entropy from TPM chips and Intel DRNG into the kernel pool.
/dev/kmsg (minor 11): the kernel ring bufferThe kmsg_fops are defined in kernel/printk/printk.c, not in mem.c. The dispatch table simply references them:
#ifdef CONFIG_PRINTK
[11] = { "kmsg", &kmsg_fops, 0, 0644 },
#endifThe 0644 permission allows any user to read kernel logs (further restricted by the dmesg_restrict sysctl on many distributions), while only root can write. Writing to /dev/kmsg injects messages into the kernel ring buffer, which is how systemd-journald and early userspace logging work.
The older /proc/kmsg interface consumes messages destructively, so only one reader works correctly. /dev/kmsg supports multiple independent readers, each maintaining their own position. This is why systemd prefers it.
Looking across all eight devices, the architectural pattern is clear:
file_operations structs.memory_open) routes the first open() to the right handler and swaps the ops pointer./dev/null's two-line read/write to /dev/random's ChaCha20 state machine.If you're writing your own character device driver, this is the blueprint. Define a file_operations struct with your read, write, open, and whatever else you need. Register it with a major/minor number. The kernel does the rest.
The code lives in two files:
drivers/char/mem.c: the dispatch table and all devices except randomdrivers/char/random.c: the CSPRNG powering /dev/random and /dev/urandom