Traduzindo com pytranslate diretamente do Google

    06 Jan 2010

    Resolvi traduzir alguns textos em lote, e para isso utilizei esse módulo chamado pytranslate.

    Ele é um wrapper para o Google Translate e permite que você traduza textos com facilidade. O retorno dele sempre será a sentença completa (frase) traduzida.

    Utilização:

    import translate
    
    print translate.translate('hello', sl='english', tl='portuguese')
    print translate.translate('hello', sl='auto', tl='portuguese')
    print translate.translate('hallo', sl='auto', tl='portuguese')
    print translate.translate('hallo', sl='auto', tl='french')
    print translate.translate('Bonjour', sl='auto', tl='dutch')

    A instalação é simples:

    wget http://github.com/stockrt/pytranslate/tarball/master -O pytranslate.tar.gz
    tar xzvf pytranslate.tar.gz
    cd stockrt-pytranslate*
    ./setup.py install

    Você pode obter o código do pytranslate aqui

    Favorites View Comments

    Navegando com segurança e sem tarifação em cafés e hotéis

    20 Sep 2009

    Quando estiver pensando em se hospedar em um hotel, ou utilizar redes de cafés, pergunte primeiro o quanto custa o minuto ou diária de acesso à internet.

    Normalmente o custo é alto. Se este for o caso, opte pela conexão tarifada por minuto, e use o seguinte truque para economizar alguns trocados:

    1. Logue no sistema do hotel normalmente, optanto pela menor tarifa para pouco tempo de utilização;
    2. Execute um dos comandos abaixo, ssh ou putty;
    3. Faça logoff do sistema de tarifação;
    4. Configure seu browser e demais aplicativos para usar proxy SOCKS na porta local escolhida.

    Usando linux

    Se estiver usando linux no seu notebook, você poderá fazer o seguinte:

    ssh -D 9999 user@servidor

    Usando windows

    Se estiver no windows, utilize o putty:

    putty.exe -D 9999 user@servidor

    Nota: Você precisa, claro, ter um “servidor” rodando um sshd e uma conta nele :) Talvez o seu linux que fica rodando no desktop, em casa? Ou aquele servidor da faculdade/trabalho?

    Outra dica é configurar o putty para manter a sessão sempre aberta, com keepalive (Category / Connection / Seconds between keepalives / 5) ou manter a sessão ativa deixando um comando qualquer rodando no shell, que gere tráfego no terminal:

    while true; do date; sleep 5; done

    Depois de criado o proxy dinâmico

    Agora você pode fazer logoff do sistema de internet tarifado e então apontar o firefox, ou qualquer outra aplicação (uma nova sessão de ssh, por exemplo), para o proxy dinâmico local, que agora ouve na porta local 9999.

    Outros posts também descrevem essa técnica.

    Com essas medidas você pode economizar no acesso, além de garantir uma navegação muito mais segura na internet, pois tudo será tunelado e codificado pelo ssh.

    Enjoy the hacking.

    Favorites View Comments

    handling html forms with python mechanize and BeautifulSoup

    14 Sep 2009

    In the post about emulating a browser in python with mechanize I have showed you how to make some basic tricks in the web with python, but I have not showed how to login a site and how to handle a session, with html forms, links and cookies.

    Here I will show it all for you, let’s see it.

    First, you must install some dependecies:

    easy_install BeautifulSoup
    easy_install html2text

    Then, let the code speak:

    import mechanize
    import cookielib
    from BeautifulSoup import BeautifulSoup
    import html2text
    
    # Browser
    br = mechanize.Browser()
    
    # Cookie Jar
    cj = cookielib.LWPCookieJar()
    br.set_cookiejar(cj)
    
    # Browser options
    br.set_handle_equiv(True)
    br.set_handle_gzip(True)
    br.set_handle_redirect(True)
    br.set_handle_referer(True)
    br.set_handle_robots(False)
    
    # Follows refresh 0 but not hangs on refresh > 0
    br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
    
    # User-Agent (this is cheating, ok?)
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    
    # The site we will navigate into, handling it's session
    br.open('http://gmail.com')
    
    # Select the first (index zero) form
    br.select_form(nr=0)
    
    # User credentials
    br.form['Email'] = 'user'
    br.form['Passwd'] = 'password'
    
    # Login
    br.submit()
    
    # Filter all links to mail messages in the inbox
    all_msg_links = [l for l in br.links(url_regex='\?v=c&th=')]
    # Select the first 3 messages
    for msg_link in all_msg_links[0:3]:
        print msg_link
        # Open each message
        br.follow_link(msg_link)
        html = br.response().read()
        soup = BeautifulSoup(html)
        # Filter html to only show the message content
        msg = str(soup.findAll('div', attrs={'class': 'msg'})[0])
        # Show raw message content
        print msg
        # Convert html to text, easier to read but can fail if you have intl
        # chars
    #   print html2text.html2text(msg)
        print
        # Go back to the Inbox
        br.follow_link(text='Inbox')
    
    # Logout
    br.follow_link(text='Sign out')

    The basic flow is:

    • Open the site and login;
    • Session is handled by cookiejar, automatically;
    • We list the first 3 mail messages;
    • For each mail message, we open it and read it’s contents;
    • We go back to the Inbox and to the next mail message;
    • All done, we can logoff;
    • The first 3 mail messages will have it’s status changed to “read” if you look at it in the gmail web interface.

    You may ask, how do I know the name of the form fields to fill?

    We can inspect it, before filling:

    # Open the site
    br.open('http://gmail.com')
    
    # Forms
    for f in br.forms():
        print f

    The output contains the fields to fill in the form:

    ...
    <TextControl(Email=)>
    <PasswordControl(Passwd=)>
    ...

    And the rest you already know.

    Why gmail? It was only an example. I know we have libgmail , but again, it was only an example, with login forms and a session to handle :)

    Favorites View Comments

    git branch in your shell prompt

    31 Aug 2009

    Every time (every few minutes?) I was looking at my git versioned projects (all my projects?) I encounter myself in doubt:

    Am I at the branch I think I am?

    And there was I, issuing a “git branch” command to check it…

    That is enough! I said.

    Looking at the github guides I found this tip very interesting, so I decided to bettered it, and to publish as a tip here.

    It will show in your prompt which is your current branch, when your current work directory is a git initialized one.

    To use it, just place this line inside your ~/.bashrc or into /etc/profile.d/git-branch.sh or /etc/bashrc or even /etc/profile, the choice is yours:

    # git branch
    parse_git_branch() {
        git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \(.*\)/(\1) /'
    }
    PS1="\$(parse_git_branch)$PS1"

    This will give you a prompt like this one:

    stockrt@jackbauer ~ $ cd Dropbox/stockrt/git/stockrt.github.com
    (master) stockrt@jackbauer ~/Dropbox/stockrt/git/stockrt.github.com $

    Here you can see a “normal” prompt and then, when I enter one of my git versioned directories, a “git branchned” prompt.

    Way cool.

    Favorites View Comments

    Browser cache invalidation with Javascript and querystring

    30 Aug 2009

    Some time ago I started my blog here at github and noticed that new posts didn’t come live right way I published them.

    Quickly I spot the problem: They are sending HTTP Cache Headers for the index.html and all pages served by github, a 24 hour cache.

    The problem

    $ curl -I http://stockrt.github.com
    
    HTTP/1.1 200 OK
    Server: nginx/0.6.31
    Date: Sat, 22 Aug 2009 01:36:49 GMT
    Content-Type: text/html
    Content-Length: 66829
    Last-Modified: Sat, 22 Aug 2009 01:12:50 GMT
    Connection: keep-alive
    Expires: Sun, 23 Aug 2009 01:36:49 GMT
    Cache-Control: max-age=86400
    Accept-Ranges: bytes

    So, to overcome this “problem” I made this tiny trick, and published it to others to take advantage of it, in case your are hosting your pages behind an web server with Expires configured.

    The trick

    Go and clone cache_invalidation and start using the provided javascripts in your site, this way:

    <html>
    
    <head>
     <script src="http://your_site/javascripts/querystring.js" type="text/javascript"></script>
     <script src="http://your_site/javascripts/cache_invalidation.js" type="text/javascript"></script>
    </head>
    
    <body>
    </body>
    
    </html>

    Set the desired TTL inside de cache_invalidation.js file:

    // TTL: set your cache threshold here
    var ttl = 300;  // seconds

    And it is all set.

    But, why does it happen, and how it works?

    It does happen because their web server (the great nginx) is configured with what we used to call mod_expires in Apache. This module activates the Expires HTTP Cache Header.

    If you look at the response headers I got before, you would see:

    Date: Sat, 22 Aug 2009 01:36:49 GMT
    Expires: Sun, 23 Aug 2009 01:36:49 GMT

    and:

    Cache-Control: max-age=86400

    Notice that:

    $ bc <<< 86400/3600
    24

    They are saying to my browser that it should use the local copy, for the next 24 hours, when accessing this site. More precisely, when accessing index.html of this site.

    I think that, for a blog, this is a pretty big time to update the user’s cache. This cache header means that if a reader accessed you site just before your posted something, and returned to your site after you posted, he would not see any difference. He would only notice your new post the next day.

    But, you can bypass that, just passing any query string within the site’s address to the navigation bar in your browser.

    This tricks the browser to go in the source and to fetch the page, instead of using a local copy. It would only use a local copy if you have no query string or if you have already cached that url with that query string (say, in a second time you visit the same query string).

    Just because the browser would cache the same query string in a second access, I made the script to vary it on each access, and also it forces a refresh when accessing a querystring that is TTL seconds older than the current time, even if it is already cached from a previous access, say, when clicking a bookmark.

    As a front end engineer I am, I only pray to my web developer colleagues don’t find this post, ever :)

    Favorites View Comments
    « Previous Entries