lesson25.ppt

Our ‘recv1000.c’ driver
Implementing a ‘packet-receive’
capability with the Intel 82573L
network interface controller
Similarities
• There exist quite a few similarities between
implementing the ‘transmit-capability’ and the
‘receive-capability’ in a device-driver for Intel’s
82573L ethernet controller:
–
–
–
–
Identical device-discovery and ioremap steps
Same steps for ‘global reset’ of the hardware
Comparable data-structure initializations
Parallel setups for the TX and RX registers
• But there also are a few fundamental differences
(such as ‘active’ versus ‘passive’ roles for driver)
‘push’ versus ‘pull’
Host memory
transmit
packet
buffer
Ethernet controller
push
transmit-FIFO
to/from
LAN
receive
packet
buffer
pull
receive-FIFO
The ‘write()’ routine in our ‘xmit1000.c’ driver could transfer data at any time,
but the ‘read()’ routine in our ‘recv1000.c’ driver has to wait for data to arrive.
So to avoid doing any wasteful busy-waiting, our ‘recv1000.c’ driver can use
the Linux kernel’s sleep/wakeup mechanism – if it enables NIC’s interrupts!
Sleep/wakeup
• We will need to employ a wait-queue, we
will need to enable device-interrupts, and
we will need to write and install the code
for an interrupt service routine (ISR)
• So our ‘recv1000.c’ driver will have a few
additional code and data components that
were absent in our ‘xmit1000.c’ driver
Driver’s components
my_isr()
wait_queue_head
This function will awaken any sleeping reader-task
my_fops
read
‘struct’ holds one
function-pointer
my_read()
This function will program the actual data-transfer
my_get_info()
This function will allow us to inspect the receive-descriptors
module_init()
This function will detect and configure
the hardware, define page-mappings,
allocate and initialize the descriptors,
install our ISR and enable interrupts,
start the ‘receive’ engine, create the
pseudo-file and register ‘my_fops’
module_exit()
This function will do needed ‘cleanup’
when it’s time to unload our driver –
turn off the ‘receive’ engine, disable
interrupts and remove our ISR, free
memory, delete page-table entries,
the pseudo-file, and the ‘my_fops’
How NIC’s interrupts work
• There are four interrupt-related registers
which are essential for us to understand
0x00C0
ICR
Interrupt Cause Read
0x00C8
ICS
Interrupt Cause Set
0x00D0
IMS
Interrupt Mask Set/Read
0x00D8
IMC
Interrupt Mask Clear
Interrupt event-types
31
30
18 17 16 15 14
reserved
10
9
8
7
6
5
4
2
1
reserved
31: INT_ASSERTED (1=yes,0=no)
17: ACK (Rx-ACK Frame detected)
16: SRPD (Small Rx-Packet detected)
15: TXD_LOW (Tx-Descr Low Thresh hit)
9: MDAC (MDI/O Access Completed)
7: RXT0 ( Receiver Timer expired)
6: RXO (Receiver Overrun)
4: RXDMT0 (Rx-Desc Min Thresh hit)
2: LSC (Link Status Change)
1: TXQE( Transmit Queue Empty)
0: TXDW (Transmit Descriptor Written Back)
82573L
0
Interrupt Mask Set/Read
• This register is used to enable a selection
of the device’s interrupts which the driver
will be prepared to recognize and handle
• A particular interrupt becomes ‘enabled’ if
software writes a ‘1’ to the corresponding
bit of this Interrupt Mask Set register
• Writing ‘0’ to any register-bit has no effect,
so interrupts can be enabled one-at-a-time
Interrupt Mask Clear
• Your driver can discover which interrupts
have been enabled by reading IMS – but
your driver cannot ‘disable’ any interrupts
by writing to that register
• Instead a specific interrupt can be disabled
by writing a ‘1’ to the corresponding bit in
the Interrupt Mask Clear register
• Writing ‘0’ to a register-bit has no effect on
the interrupt controller’s Interrupt Mask
Interrupt Cause Read
• Whenever interrupts occur, your driver’s
interrupt service routine can discover the specific
conditions that triggered them if it reads the
Interrupt Cause Read register
• In this case your driver can clear any selection of
these bits (except bit #31) by writing ‘1’s to them
(writing ‘0’s to this register will have no effect)
• If case no interrupt has occurred, reading this
register may have the side-effect of clearing it
Interrupt Cause Set
• For testing your driver’s interrupt-handler,
you can artificially trigger any particular
combination of interrupts by writing ‘1’s
into the corresponding register-bits of this
Interrupt Cause Set register (assuming
your combination of bits corresponds to
interrupts that are ‘enabled’ by ‘1’s being
present for them in the Interrupt Mask)
Our interrupt-handler
• We decided to enable all possible causes
(and we ‘log’ them via ‘printk()’ messages
we’ve omitted in the code-fragment here):
irqreturn_t my_isr( int irq, void *dev_id )
{
int
intr_cause = ioread32( io + E1000_ICR );
if ( intr_cause == 0 ) return IRQ_NONE;
wake_up_interruptible( &wq_rd );
iowrite32( intr_cause, io + E1000_ICR );
return
}
IRQ_HANDLED;
We ‘tweak’ our packet-format
• Our ‘xmit1000.c’ driver elected to have the
NIC append ‘padding’ to any short packets
• But this prevents a receiver from knowing
how many bytes represent actual data
• To solve this problem, we added our own
‘count’ field to each packet’s payload
0
6
destination MAC-address
source MAC-address
actual bytes of user-data
12
Type/Len
14
count
Our ‘read()’ method
ssize_t my_read( struct file *file, char *buf, size_t len, loff_t *pos )
{
static int
rxhead = 0;
// to remember where we left off
unsigned char
*from = phys_to_virt( rxdesc[ rxhead ].base_addr );
unsigned int
count;
// go to sleep if no new data-packets have been received yet
if ( ioread32( io + E1000_RDH ) == rxhead )
if ( wait_event_interruptible( wq_rd,
ioread32( io + E1000_RDH ) != rxhead ) ) return –EINTR;
// get the number of actual data-bytes in the new (possibly padded) data-packet
count = *(unsigned short*)(from + 14); // data-count as stored by ‘xmit1000.c’
if ( count > len ) count = len; // can’t transfer more bytes than buffer can hold
if ( copy_to_user( buf, from+16, count ) ) return –EFAULT;
// advance our static array-index variable to the next receive-descriptor
rxhead = (1 + rxhead) % 8;
// this index wraps-around after 8 descriptors
return
count;
// tell kernel how many bytes were transferred
}
Hardware’s initialization
• We allocate and initialize a minimum-size
Receive Descriptor Queue (8 descriptors)
• We perform a ‘global reset’ via the RST-bit
in the NIC’s Device Control register (with a
side-effect of zeroing both RDH and RDT)
• We configure the ‘receive’ engine (RCTL)
plus a few additional registers that affect
the network-controller’s reception-options
(namely: RXCSUM, RFCTL, PSRCTL)
Receive Control (0x0100)
31
R
=0
30
29
0
28
27
F
0LXBUF
15
B
A
M
14
R
=0
13
MO
26
25
SE
CRC
BSEX
12
24
R
23
22
PMCF
DPF
=0
11
DTYP
10
9
8
RDMTS
21
20
R
CFI
=0
7
6
I
S
L
LBML
O
S
U
19
CFI
EN
5
18
17
BSIZE
VFE
4
16
3
2
LPE MPE UPE SBP
0
1
0
E
R
0N
=0
EN = Receive Enable
DTYP = Descriptor Type
DPF = Discard Pause Frames
SBP = Store Bad Packets
MO = Multicast Offset
PMCF = Pass MAC Control Frames
UPE = Unicast Promiscuous Enable
BAM = Broadcast Accept Mode
BSEX = Buffer Size Extension
MPE = Multicast Promiscuous Enable BSIZE = Receive Buffer Size
SECRC = Strip Ethernet CRC
LPE = Long Packet reception Enable VFE = VLAN Filter Enable
FLXBUF = Flexible Buffer size
LBM = Loopback Mode
CFIEN = Canonical Form Indicator Enable
RDMTS = Rx-Descriptor Minimum Threshold Size
CFI = Cannonical Form Indicator bit-value
Our driver initially will program this register with the value 0x0400801C. Then
later, when everything is ready, it will turn on bit #1 to ‘start the receive engine’
82573L
Packet-Split Rx Control (0x2170)
31 30 29
0 0
24 23 22 21
BSIZE3
(in KB)
0 0
16 15 14 13
BSIZE2
(in KB)
0 0
8
BSIZE1
(in KB)
7
0
6
0
BSIZE0
(in 1/8 KB)
If the controller is configured to use the packet-split feature (RCTL.DTYP=1),
then this register controls the sizes of the four receive-buffers, so there are
certain requirements that nonzero values appear in several of these fields.
But our ‘recv1000.c’ driver will use the ‘legacy’ receive-descriptor format
(i.e., RCRL.DTYP=0) and so this register will be disregarded by the NIC
and therefore we are allowed to program it with the value 0x00000000.
Receive Filter Control (0x5008)
31
30
PHY
VME
RST
29
R
=0
15
EXSTEN
28
27
26
TFCE RFCE RST
14
25
24
23
22
R reserved
R R R
=0
=0
=0
=0
13
12
11
IPFRSP ACKD
_DIS
_DIS
ACK
DIS
IPv6
XSUM
_DIS
10
IPv6
_DIS
9
8
NFS_VER
21
R
=0
20
19
ADV
D3
WUC
7
6
NSFR
_DIS
NSFW
_DIS
R
=0
5
18
D/UD
status
4
17
16
R
=0
3
2
GIO
1
R iSCSI_DWC
R R M0 0
=0
=0
=1
D
0
iSCSI
_DIS
Our driver writes 0x00000000 to this register, which among other effects will
cause the ethernet controller NOT to write Extended Status information into
our device-driver’s legacy-format Receive Descriptors (bit 15: EXTEN=0)
RX Checksum Control (0x5000)
31
10 9 8 7
reserved
0
packet
checksum
start
TCP/UDP Checksum Off-load enabled (1=yes, 0=no)
IP Checksum Off-load enabled (1=yes, 0=no)
This field controls the starting byte for the Packet Checksum calculation
Our driver programs this register with the value 0x00000000 (which disables
Checksum Off-loading for TCP/UDP packets (which we won’t be receiving)
and for IP packets (which likewise won’t be sent by our ‘xmit1000.c’ driver),
and all Packet-Checksums will be calculated starting from the very first byte
Rx-Descriptor Control (0x2828)
31
0
30
29
0
28
0
15
0
27
0
25
24
0
0
0
G
R
A
N
13
12
11
10
0
14
26
0
FRC HTHRESH
FRC
0
DPLX
SPD
(Host
Threshold)
23
22
0
0
9
8
21
20
19
18
17
16
WTHRESH
(Writeback Threshold)
7
I
L
0
O0
S
6
00
5
A
S
D
E
4
3
2
1
L
PTHRESH
R
0
00 00
(Prefetch
S Threshold)
T
“This register controls the fetching and write back of receive descriptors.
The three threshhold values are used to determine when descriptors are
read from, and written to, host memory. Their values can be in units of
cache lines or of descriptors (each descriptor is 16 bytes), based on the
value of the GRAN bit (0=cache lines, 1=descriptors). When GRAN = 1,
all descriptors are written back (even if not requested).” --Intel manual
Recommended for 82573: 0x01010000 (GRAN=1, WTHRESH=1)
0
Maximum-size buffers
• We use a minimal number of maximumsize receive-buffers (eight of 1536-bytes)
buffer
7
buffer
6
buffer
5
buffer
4
buffer
3
buffer
2
buffer
1
buffer
0
kernel
memory
ring of eight
rx-descriptors
NIC “owns” our rx-descriptors
RDBAH/RDBAL
RDLEN
=0x80
0
1
2
3
4
5
6
7
8
Our ‘static’ variable
rxhead
descriptor 0
descriptor 1
descriptor 2
descriptor 3
RDH
This register gets
initialized to 0, then
gets changed by the
controller as new
packets are received
descriptor 4
descriptor 5
descriptor 6
descriptor 7
descriptor 8
RDT
This register gets
initialized to 8, then
never gets changed
Driver ‘defects’
• If an application tries to ‘read’ from our
device-file ‘/dev/nic’, but the controller
received a packet that contains more
bytes of data than the user requested,
excess bytes get “lost’ (i.e., discarded)
• If an application delays reading packets
while the controller continues receiving,
then an earlier packet gets “overwritten”
In-class exercise #1
• Discuss with your nearest class-member
your ideas for how these driver ‘defects’
might be overcome, so that packet-data
being received will be protected against
getting “lost” and/or being “overwritten”
In-class exercise #2
• Login to a pair of machines on the ‘anchor’
cluster and install our ‘xmit1000.ko’ and
our ‘recv1000.ko’ modules (one on each)
• Try transferring a textfile from one of the
machines to the other, by using ‘cat’:
anchor01$ cat textfile > /dev/nic
anchor02$ cat /dev/nic > recv1000.out
• How large a textfile can you successfully
transfer using our simple driver-modules?