Thu 13 Apr 2006
Using epoll() For Asynchronous Network Programming
Posted by Scoundrel under Development , Networks ·
General way to implement tcp servers is “one thread/process per connection”. But on high loads this approach can be not so efficient and we need to use another patterns of connection handling. In this article I will describe how to implement tcp-server with synchronous connections handling using epoll() system call of Linux 2.6. kernel.
epoll is a new system call introduced in Linux 2.6. It is designed to replace the deprecated select (and also poll). Unlike these earlier system calls, which are O(n), epoll is an O(1) algorithm - this means that it scales well as the number of watched file descriptors increase. select uses a linear search through the list of watched file descriptors, which causes its O(n) behaviour, whereas epoll uses callbacks in the kernel file structure.
Another fundamental difference of epoll is that it can be used in an edge-triggered, as opposed to level-triggered, fashion. This means that you receive “hints” when the kernel believes the file descriptor has become ready for I/O, as opposed to being told “I/O can be carried out on this file descriptor”. This has a couple of minor advantages: kernel space doesn’t need to keep track of the state of the file descriptor, although it might just push that problem into user space, and user space programs can be more flexible (e.g. the readiness change notification can just be ignored).
To use epoll method you need to make following steps in your application:
- Create specific file descriptor for epoll calls:
epfd = epoll_create(EPOLL_QUEUE_LEN);
where EPOLL_QUEUE_LEN is the maximum number of connection descriptors you expect to manage at one time. The return value is a file descriptor that will be used in epoll calls later. This descriptor can be closed with close() when you do not longer need it. - After first step you can add your descriptors to epoll with following call:
static struct epoll_event ev; int client_sock; ... ev.events = EPOLLIN | EPOLLPRI | EPOLLERR | EPOLLHUP; ev.data.fd = client_sock; int res = epoll_ctl(epfd, EPOLL_CTL_ADD, client_sock, &ev);
where ev is epoll event configuration sctucture, EPOLL_CTL_ADD - predefined command constant to add sockets to epoll. Detailed description of epoll_ctl flags can be found in epoll_ctl(2) man page. When client_sock descriptor will be closed, it will be automatically deleted from epoll descriptor. - When all your descriptors will be added to epoll, your process can idle and wait to something to do with epoll’ed sockets:
while (1) { // wait for something to do... int nfds = epoll_wait(epfd, events, MAX_EPOLL_EVENTS_PER_RUN, EPOLL_RUN_TIMEOUT); if (nfds < 0) die("Error in epoll_wait!"); // for each ready socket for(int i = 0; i < nfds; i++) { int fd = events[i].data.fd; handle_io_on_socket(fd); } }
Typical architecture of your application (networking part) is described below. This architecture allow almost unlimited scalability of your application on single and multi-processor systems:
- Listener - thread that performs bind() and listen() calls and waits for incoming conncetions. Then new connection arrives, this thread can do accept() on listening socket an send accepted connection socket to one of the I/O-workers.
- I/O-Worker(s) - one or more threads to receive connections from listener and to add them to epoll. Main loop of the generic I/O-worker looks like last step of epoll using pattern described above.
- Data Processing Worker(s) - one or more threads to receive data from and send data to I/O-workers and to perform data processing.
As you can see, epoll() API is very simple but believe me, it is very powerful. Linear scalability allows you to manage huge amounts of parallel connections with small amout of worker processes comparing to classical one-thread per connection.
If you want to read more about epoll or you want to look at some benchmarks, you can visit epoll Scalability Web Page at Sourceforge. Another interesting resources are:
- The C10K problem: a most known page about handling many connections and various I/O paradigms including epoll().
- libevent: high-level event-handling library ontop of the epoll. This page contains some information about performance tests of epoll.
- My Favorite Books About IT
- Computer books all of which are available for free download
- Small Tip: How to set up two interface Xen machine
- Best Tech Videos On The Net
- Berkeley lectures as podcasts
2006-04-14 at 6.15 pm
It should be pointed out that if you use this approach, all code from handle_io_on_socket must avoid blocking no matter what. This can be nearly impossible in an application that’s not multi-threaded.
2006-10-27 at 4.46 pm
This is exactly what I was looking for, thanks for the great information.
2007-06-23 at 8.26 pm
Not if you use the aio functions.
2007-06-28 at 1.15 pm
Is it possible to use FD (file descriptor) meant for poll() with epoll()?
2007-08-26 at 1.42 pm
Eranga, yes.
OT: Great tutorial, even though you never declared events.
Well done Scoundrel.
2007-09-04 at 8.29 am
I found a website with an epoll example written by zhoulifa(zhoulifa@163.com). Its comments are in chinese, could anyone translate it/document it in english so it will be understood better? http://zhoulifa.bokee.com/6081520.html
Btw, why must it avoid nonblocking on function handle_io_on_socket? It accessing db like MySQL nonblocking?
2007-10-23 at 7.08 pm
Sonny,
Just use google’s translator. CLick this link to see a translation (should help a little):
http://translate.google.com/translate?hl=en&sl=zh-CN&u=http://zhoulifa.bokee.com/6081520.html&sa=X&oi=translate&resnum=9&ct=result&prev=/search%3Fq%3Depoll%2Bserver%26complete%3D1%26hl%3Den%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26sa%3DG
2008-07-07 at 6.32 am
Hi,
Thanks for the info for epoll. These are helpful.
I had one question regarding the user data variable given as part of epoll_event structure.
If only “fd” is used for epolling, why are u32/u64 and void pointers provided.
thanks,
Prashanth
2008-07-20 at 4.31 pm
actually “fd” is not “used” for epolling. The fd is passed separately to epoll_wait, the data structure is used for passing in any data the user requires. As it is a union structure writing to “void *ptr” will overwrite fd.
This data structure is useful for passing in data that may be useful to the user of the data. For example you can cache data that has been gathered by the connection previously and then store this in a structure which the void *ptr points to. When more data is ready you then have access to the previously stored data which you can add to with the further communication.
James
2008-07-31 at 1.35 pm
занимательно, надо будет py-epoll погонять)
тем более сейчас джангу перевёл на асинхронный сервер который использует epoll В)
кстати если не ошибаюсь, результатом C10K problem стал сервер lighttpd
2008-08-27 at 1.22 am
Great write up. I couldn’t find a more detailed example than this. However, how do you know when a client closes a connection? It seems that you should be able to check your event against EPOLLHUP but the event number shown when a client closes is 0×5 while EPOLLHUP is defined as 0×10. So is there some bit masking I have to do?
Thanks,
Addisu