A new spamtrap submodule is currently under development. Its targets are spamtraps located on mailservers which I administer. Few of these mailservers generate huge amounts of spam mails and this leads to great performance troubles if you try to download them by POP3/IMAP and then parse. A different approach was thought for situations like these. In fact, I developed a small agent which has to be run on the mailserver host. This agent loops listing the spam files in the maildir and parsing them without any network-based data transfer. When it has done, it saves the interesting data in a serialized form on the filesystem (through the Python cPickle module) and assigns to this data a version number. This allows a remote agent to ask the last version and download just the missing versions. This submodule was developed using Twisted Perspective Broker directly serializing on the wire saved data and currently defines a basic authentication mechanism too. While developing this submodule I was thinking that it could be nice to use it for sharing data between researchers coming from multiple spamtraps. Suggestions are welcome!
Archive for July, 2009
Few days ago I started thinking about the scalability limits of the TIP Fast-Flux Tracking module and realized its design was really awful. The approach was based on the idea of assigning a monitoring thread to each fluxy domain. This approach is well suited if the number of threads is quite small but not for what I was just realizing. First of all, when the number of threads starts growing the performance starts decreasing due to the Python Global Interpreter Lock which limits concurrency of a single interpreter process with multiple threads (and there are no improvements in running the process on a multiprocessor system). Moreover, it’s really hard to guarantee each thread enough stack space for running not raising segmentation faults. For these reasons I decided to rewrite the module from scratch and currently I’m testing it. The new design is really simple, effective and scalable and I have to thank Jose Nazario, Marcello Barnaba and Orlando Bassotto for the really interesting talks we had about this matter. Just one process and no monitoring threads. The code is written is such a way not to have blocking calls thus realizing a really asynchronous module. But when a domain starts being monitored there’s the need to access to backend database thus requiring blocking calls. When this happens, the blocking calls are delegated to the Twisted thread pool with a cloned copy of the collected data in order not to compromise code scalability with not necessary locks. Moreover the module is now turning to be a Twisted Application of its own and the first tests done using the Twisted Epoll Reactor are absolutely encouraging. Stay tuned!