Best In Class - The Blog

My thoughts on FP, Business, Sales and more. Thanks for reading along.

Earlier today I found myself wondering about the Page Ranks on certain keywords for my websites and decided to write a utility to scrape Google. It seemed like a good idea :)

Strategy

Scraping Google should be very simple

  • Construct a search string
  • Download the page served at that particular uri
  • Extract links and record their rank
  • PROFIT

Tools

For this we need only two tools:

  • clj-http 0.4.0- for the actual download
  • enlive 1.0.0 - for the scraping

Getting set up

(ns gsearch.core
  (:use net.cgrand.enlive-html
        [clj-http.client :rename {get download}]))

(def user-agent "Scraptastic")
(def base (str "http://www.google.dk/search?hl=en&q="))

clj-http offers the get function, which allows me to download pages from the web, but to avoid the clash with clojure.core/get I’ve renamed it to download. Secondly I’ve defined my user-agent, which you can set to whatever you want and finally I’ve defined the base string which is the first component in all searches.

Next I did a Google Search and inspected a couple of the hits on the page. It appears that the simples route to the links is

<li class="g">
    <h3 class="r"> 
         <a href=".........."

Where the a-tag contains the link I need to cross reference with the domain I’m currently investigating.

Tricky northern characters

While doing the manual testing, I noticed that Google manhandles the scandinavian letters so in addition to replacing spaces with +’s we need to encode special characters as well. He’s a quick and dirty conversion which scratches my itch:

(defn encode
  "Encodes S to Google Format"
  [s]
  (let [chars {\ø "%C3%B8" \Ø "%C3%98" \æ "%C3%A6" \Æ "%C3%86"
               \å "%C3%A5" \Å "%C3%85" \space "+"}]
    (reduce #(str %1 (get chars %2 %2)) "" s)))

Downloading each page

Now that we have the tools to construct a search string all we need to do is fetch the actual page that we’ll be scraping:

(defn fetch-url
  "Downloads a document as an html-resource"
  [uri]
  (-> (download uri {:headers {"User-Agent" user-agent}})
      :body html-snippet))

Quite simply, this downloads the page, flashing a certain user-string and then converts it to an enlive/html-snippet.

Extracting the links

I’ve decided that all I want is the first occurance of the domain we’re currently testing and here’s how to get it:

(defn scan-single-page
  "Scans a single page, returning the first position of the domain if any"
  [domain page]
  (->> (select page [:li.g :h3.r :a])
       (map #(-> % :attrs :href))
       (keep-indexed #(when (.contains %2 domain) (inc %1)))
       first))

First we use Enlives select fn to filter out all of the a-tags that are nested beneath li/h3. From these we extract the href attribute. To determine which position on the page a link is found at I use keep-index which both scans to see if we’ve found domain and if so returns the index that it was found at.

If it wasn’t for Google pagination, this would actually give us what we want, but there is a final hoop that we need to jump through:

Digging deeper

It makes sense to look beyond the frontpage, because if my site appear on pages 2, 3, 4 that might be a good place to do some SEO. To flip through the pages all we need to do is attach “&start=??” where ?? is an integer between 10 and 100. As I found out, Google quickly detects mass scrapes so I recommend capping your searches at 100 hits.

Because I want this process to short-circuit when it gets a hit, I’ve utilized loop:

(defn scan-page
  "Scans a page and subsequent sub-pages if no hits are obtained"
  [domain uri]
  (loop [pagerank (scan-single-page domain (fetch-url uri)) offset 0]
    (if (number? pagerank)
      (+ pagerank offset)
      (if (>= offset 100)
        -1
        (recur (scan-single-page domain
                  (fetch-url (str uri "&start=" (+ 10 offset))))
               (+ 10 offset))))))

As you can hopefully deduce from the code, this scrapes a page, checks for a hit, if none is found and we haven’t looked at 100 posts yet, we bump offset by 10 and repeat the process. The result will be either -1 (not found) or a number between 1 and 100 which is our pagerank.

Tying it together

All thats left is automating the process and presenting the results. Here is a simple way to go about it:

(defn assess
  "Determines google ranks for certain keywords for the domain"
  [domain keywords]
  (let [urls   (map #(str base (encode %)) keywords)
        report (->> (map #(vector %1 (scan-page domain %2)) keywords urls)
                    (sort-by last <))]
    (println domain)
    (println (apply str (repeat (count domain) \-)))
    (doseq [[kw pr] report]
      (println (format "%s:\t %s" (or pr "--") kw)))))

And run it like so:

gsearch.core>  (assess "www.bestinclass.dk" 
                      ["software udvikling"
                      "clojure konsulent"
                      "clojure vs scala"])
www.bestinclass.dk
------------------
-1:  software udvikling
1:   clojure konsulent
4:   clojure vs scala

Conclusion

With Clojure/Enlive its both quick and fun to set up scrapers of all kinds. After a few minutes I had learned all my Page Ranks for the most interesting keywords but as I got more playful Google got more upset and banned me after 5 minutes of fun time - So use at your own discretion!

As most of us know, evolution is a slow process and this became more true than usual for me the other night.

I had a very simple problem:

Given a path S, we know that the path starts with “dropzone/” and then an integer (sid) followed by another “/” and then finally the relative path which we want to extract.

Because I’ve tried actively using J to think differently about dataprocessing I ended up jotting down some of the possible solutions for this in Clojure. Instead of simply thinking for a minute or so and then writing a solution, I included a few different approaches which you will see in the notes below.

One by One

One way to think about this problem is as though S is simply folder/file names seperated by slashes (/), so by generating a sequence of those folder/filenames we can drop items until we reach the integer we’re looking for, ie. sid:

(defn strip-first
  [sid s]
  (->> (.split s "/")
       reverse
       (take-while #(not= (str sid) %))
       reverse
       (clojure.string/join "/")))

It reads like plain english, except perhaps for the last line which simply reassembles the string. The reason for the double reverse trick, is that drop-while would stop 1 position prematurely, so either way something has to be done manually to get the right path.

But the above solution is silly of course and its not really how we think about data. Naturally I would think about the string as a sequence which I can view through an ever expanding window and when that window is large enough to include the /sid/ we remove the matching text from S.

(defn strip-first
  [sid s]
  (let [substring (->> (reduce #(conj %1 (apply str (take %2 s))) []
                               (range 1 (-> s count inc)))
                       (filter #(.contains % (str "/" sid "/")))
                       first)]
    (.replace s substring "")))

The reducer builds this ever increasing window:

user> (reduce #(conj %1 (apply str (take %2 s))) []
              (range 1 (-> s count inc)))

["d" "dr" "dro" "drop" "dropz" "dropzo" "dropzon" "dropzone" ... "dropzone/122/images/1.jpg"]

And in this sequence we simply grab the first occurance of /sid/ and strip it from S. If you’re like me you frown upon calls to first/last like the one above and the best way to prettify such an op is simply to use destructuring, ie. (let [[path & _] …

But with this approach, we would actually be better served by using the more idiomatic reductions, which automates this sliding window approach:

(defn strip-first
  [sid s]
  (when-first [prepath (->> (reductions str "" s)
                            (filter #(.contains % (str "/" sid "/"))))]
    (.replace s prepath "")))

The big problem with this approach is that we’re neglecting to use all the data available to us. Because we know how many characters are in “dropzone/” and also the length of the string representation of sid, why not take the easy way out and simply jump that numbers of characters into S ?

 (defn strip-first
  [sid s]
  (.substring s (-> sid str count inc
                    (+ (count "dropzone/")))))

But if we’re going to go down this road, the simplest version would simply be

(defn strip-first
  [sid s]
  (.replace s (format "dropzone/%d/" sid) ""))

This takes all the “functional fun” out of this little task, but in development it really is a matter of using a hammer for driving in nails instead of a screwdriver. Which also means, that since we’re picking text apart, Regular Expressions are really the way to go and they even allow us to reduce the signature down to only S, which makes this the optimal solution for me:

(defn strip-first
  [s]
  (last (re-find #"dropzone\/\d+\/(.+)" s)))

Before wrapping up let me answer the million dollar question: How would you do this in J? I felt that the most natural course in J was simply to count the length of “dropzone/” and add the length of the string representation of the integer and then drop those letters from S:

sid=: 122
str=: 'dropzone/122/images/1.jpg'
str }.~(#'dropzone/')+(>:<.10^.sid)+1
> images/1.jpg

While this works in the REPL its not an actual function definition but fortunately the venerable Tracy Harms showed me a quick way to convert it:

stripfirst=: 4 : 0
 sid=.x
 str=.y
 str }.~(#'dropzone/')+(>:<.10^.sid)+1
)

122 stripfirst 'dropzone/122/images/1.jpg'
> images/1.jpg

In my typical development process I usually run through various scenarios in my mind before I start writing anything down, but because I was so tired, almost exhausted, when writing this I felt more comfortable writing every step down, which I in turn was a fun read afterwards - I hope you agree and that I haven’t wasted your time :)

And by the by, if you would like to see how to convert the J definition into tacit form, let me know:

stripfirst=: ] }.~ (#'dropzone/')+ 2+ <. @ (10^. [)

Its almost like black magic. Tracy was adamant that I should also present a simpler J solution:

stripfirst=: verb :'; 4 }. ;: y'

Or in tacit form:

stripfirst=: [: ; 4 }. ;:

This is actually quite easy to read: ‘Drop 4 sections (4 }.) from the tokens found by the default J parser (;:) from the argument (y). The downside is, that if the word ‘dropzone’ is prefixed with a slash then it breaks, but Tracy always writes his routines to assume perfect input data and if thats not guaranteed then he puts a filter function in front of this actual fn, instead of adding validation logic. Its actually not a bad way to code :)

For reasons that I must have mentioned either in tweets or in some old blogposts, I love Clojure. But I’m also very fond of both Haskell and J so Im going to go through some solutions for trivial problems in all three languages simultaneously.

Problem 1

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

Clojure

In Clojure this problem can be expressed very much like you how you would mentally approach the task:

(->> (range 1000)
     (filter #(or (zero? (mod % 3)) 
                  (zero? (mod % 5))))
     (reduce +))

Haskell

Haskell has great syntax for building lists with filters:

Prelude> [x | x <- [1..5]] -- No filter
[1,2,3,4,5]
    
Prelude> [x | x <- [1..5], x >= 2] -- Single filter
[2,3,4,5]

So bearing that in mind, the Haskell solution also reads like plain english:

sum [x | x <- [1..999], x `mod` 3 == 0 || x `mod` 5 == 0]

X is populated with a list of numbers in the range 1…999, these are filtered using two predicates.

J

Since our final contestant isn’t a functional language but an array language, its solution looks a bit different. For me, J is mostly for fun so a lot of the time I’ll write something that works and then proceed to rewrite it 50 times. Here’s one such example:

+/(((0=3|/1+i.1000-1) +. (0=5|/1+i.1000-1)) * (1+i.1000-1))

At first glance its pretty obvious that the first two chunks build the list of numbers that are divisible by either 3 or 5 and these are then fed to the final list of numbers. The reason for the multiplication (instead of sum) becomes obvious once you understand how the lists are built:

0=3|/1+i.10

This is read as:

0=

Zero is matched against the elements on the righthand side, returning true/false.

3|/ 

The | means residue/modulo and / is insertion, meaning | is inserted between the elements of the following list, so in Clojure terms (reduce #(= mod % #) …).

1+i.10

This is simply a list of 10 integers starting a 1.

Since we’re testing with 0= this will return either true or false:

   0=3|/1+i.10
0 0 1 0 0 1 0 0 1 0

Since a hit does not give us the value but only a 1, we need to multiply these ones by their index in i.1000. The solution could more succintly be written as:

+/(*(3|])+.&(0=[)5|])1+i.1000-1

The new operators used are [ and ] which basically mean “rightside argument” and “leftside argument” somewhat like % in Clojure. In addition you need to know that +. means OR and that & is simply used to compose two operations. The result of this combination is multiplied (the first *) with the list on the righthand side. What that does is that it enables us to call sum +/ on the resulting list. It doesn’t exactly read like plain english and in a trivial example like this, simply redeffing some names in Clojure makes them almost equal in length:

clojure: (r +(f #(or(z?(m % 3))(z?(m % 5)))(rn 1000)))
j:         +/ (*(3|]) +. &(0=[)5|]) 1+ i.1000 - 1

Problem 2

I completely lost my motivation for coding Haskell/J solutions for this problem, after reading this comment in the forum:

I estimate that I had written about 3 million lines
 of assembler code in my whole life. Now, code only
 when strictly necessary.

Phi (golden ratio) is the approximate ratio between
 two consecutive terms in a Fibonacci sequence.
The ratio between consecutive even terms approaches
 phi^3 (4.236068) because each 3rd term is even.
Use a calculator and round the results to the nearest
 integer when calculating the next terms:

 2,8,34,.. multiplying by 4.236068 each time: 144,610,
 2584,10946,46368,196418 & 832040

The sum is 1089154

My codeless regards,
Rudy.

Rosetta Code - Caesar Cipher

So, instead of the math heavy Euler problems lets try our hand with the Caesar Cipher

The key is an integer from 1 to 25. This cipher rotates the letters of the alphabet (A to Z). The encryption replaces each letter with the 1st to 25th next letter in the alphabet (wrapping Z to A). So key 2 encrypts “HI” to “JK”, but key 20 encrypts “HI” to “BC”.

Clojure

There are many ways to approach this simple problem, but think about the alphabet as an infinite stream of repeating characters. If thats the case then all we need to do is jump 26 characters into that stream and then we can freely move back and forth in the stream to de/encrypt. A stream like that is easily constructed in Clojure:

(def characters (map char (cycle (range 65 91))))

Using the lookup fn nth we can write a succint translate function:

 (defn translate
  ([offset s] (translate offset s 1))
  ([offset s modifier]
     (let [chars (map char (cycle (range 65 91)))
           offset (+ offset 26)
           s      (upper-case s)]
       (->> (remove #(= % \space) s)
            (map #(nth chars (+ (* offset modifier) (int %))))
            (apply str)))))


user> (translate 2 "The five boxing wizards jump quickly")
"IWTUXKTQDMXCVLXOPGSHYJBEFJXRZAN"
user> (translate 2 *1 -1)
"THEFIVEBOXINGWIZARDSJUMPQUICKLY"
user> (translate 22 "The five boxing wizards jump quickly")
"CQNORENKXGRWPFRIJAMBSDVYZDRLTUH"
user> (translate 22 *1 -1)
"THEFIVEBOXINGWIZARDSJUMPQUICKLY"

Haskell

The Haskell solution on Rosetta approaches the problem from a slightly different angle:

import Data.Char (ord, chr)
import Data.Ix (inRange)
 
caesar :: Int -> String -> String
caesar k = map f
  where
    f c
      | inRange ('a','z') c = tr 'a' k c
      | inRange ('A','Z') c = tr 'A' k c
      | otherwise = c
 
unCaesar :: Int -> String -> String
unCaesar k = caesar (-k)
 
-- char addition                                                                     
tr :: Char -> Int -> Char -> Char
tr base offset char = chr $ ord base + (ord char - ord base + offset) `mod` 26

It does pretty much what it says on the tin. Here’s a quick breakdown:

caesar :: Int -> String -> String

There are several ways to declare functions in Haskell and this one simply says that caesar is a function which takes an Int and a String and returns a String. In typical functional tradition we then map a function across out collection of characters:

      | inRange ('a','z') c = tr 'a' k c
      | inRange ('A','Z') c = tr 'A' k c
      | otherwise = c

This simply says that if the current item is a lowercase letter than translate it as such, ditto for upper case and for anything else (ie. space) then just return the unaltered version. This would look almost the same in Clojure.

Finally, we have the actual translation:

tr base offset char = chr $ ord base + 
                      (ord char - ord base + offset) `mod` 26

This is much like the Clojure version, except instead of constructing an infinite sequence the index is wrapped using modulo. The use of $ simply combines chr and ord much like & does in J and ord/chr equals int/char in the Clojure version.

J

A task such as this is one of the many examples where J really shines - Here goes:

cndx=: [: , 65 97 +/ 26 | (i.26)&+
caesar=: (cndx 0)}&a.@u:@cndx@[ {~ a.i.]

First, cndx is defined using the [: cap operator. The cap op lets us train several operations together following the rules outlined here. But since J evaluates from the right, we have to start there.

The rightmost operator is + which means “add to the argument”. In the definition of caesar we see that cndx is called with a single argument, a number, so in this case we’re adding to that number. The item we’re adding is (i.26), meaning a list of integers from 1 - 26. Now we can separate this part out and into the REPL:

   ((i.26)&+) 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
   ((i.26)&+) 10
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
 

As you can see, it simply displaces the offset. The next part adds the modulo operator for the wrapping:

   26 | ((i.26)&+) 10
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0 1 2 3 4 5 6 7 8 9

In the next phase we come across one of the really cool features of an array-language, that we can add 2 integers to the entire rightside argument:

   65 97 +/ 26 | ((i.26)&+) 10
 75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 65 66 67  68  69  70  71  72  73  74
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 97 98 99 100 101 102 103 104 105 106

Its difficult to see when we’re not in the REPL, but the above is actually 2 lists of the length 26 characters each. We can read the dimension using the $ operator:

   $ 65 97 +/ 26 | ((i.26)&+) 10
2 26

Now we have a sequence of upper+/lower-case letters and all we need to do is explain the first two components: [: ,

The [: can be viewed as syntactic sugar as it simply allows to to compose a more concise statement with less parens. The comma has a significant effect in that it flattens (ravels) the list into a single dimension list:

   , 65 97 +/ 26 | ((i.26)&+) 10
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 65 66 67 68 69 70 71 72 73 74 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 97 98 99 100 101 102 103 104 105 106
   $ , 65 97 +/ 26 | ((i.26)&+) 10
52

When starting to read tacit statements like cndx, I was recommended by J expert Tracy Harms to always start from the righthand side and to look for fork patterns. The pattern in cndx is:

cndxAlt=: vL3 vM3 (vL2 vM2 (vL1 vM1 vR1))

Here you see 3 forks with 3 components each (no other way) and broken down they’re:

 vL3=: [:
   vM3=: ,
   vL2=: 65 97"_
   vM2=: +/
   vL1=: 26"_
   vM1=: |
   vR1=: (i.26)&+

The [: prefix allows us to avoid the parens and the “_ are a different story.

I’ve walked you through the first part of the J solution pretty much like Tracy Harms walked me through it, but its beyond the scope of this evening to walk through caesar. If it turns out that more people than Tracy and I appreciate J, I’ll be happy to do it :)

These Google challenges are typically a great suit for functional languages as the point is often to arrive at an algorithm which you can apply to the problem at hand. When reviewing the solutions, its interesting to compare high/low level languages as well as imperative/funtional languages. After having solved the problem, I went to find other solutions on Google and here is one I came across written in Java: Java Solution

The Java Solution

When reading the solution I notice the following:

  • The developer spent 1.5 hours on it
  • A large part of the code is spent on reading/writing files
  • The solution itself is handwritten - the program does not learn from the data

The Problem

You can read the official explanation here: Challenge #1

The short version is this: Google provides you with an unreadable language for which you must develop a translator. In order to make the translator you get 3 lines of pre/post translated text as well as 3 character mappings. The task is then to discover the remaining 24 mappings.

Repl Play

Since we’re trying to produce mappings we need to construct a hash map where the keys are english characters and the values are the Googleresesian counterparts, so lets begin by loading the sample:

(def input ["ejp mysljylc kd kxveddknmc re jsicpdrysi"
            "rbcpc ypc rtcsra dkh wyfrepkym veddknkmkrkcd"
            "de kr kd eoya kw aej tysr re ujdr lkgc jv"])

(def google ["our language is impossible to understand"
             "there are twenty six factorial possibilities"
             "so it is okay if you want to just give up"])

In order to produce the initial mapping all we have to do is traverse each character in each string and store the relation:

(->> (map #(map (fn [i g] (hash-map i g)) %1 %2) input google)
     flatten
     (apply merge))
{\space \space, \a \y, \b \h, \c \e, \d \s, \e \o, \f \c, \g \v, \h \x, \i \d, \j \u, \k \i, \l \g, \m \l, \n \b, \o \k, \p \r, \r \t, \s \n, \t \w, \u \j, \v \p, \w \f, \x \m, \y \a}

The outer map walks the strings, the inner map walks the characters of each string, returning a hash-map with the key/val pair. All that’s needed is now to manually add the mapping given by Google in the explanation of the problem:

(def dict (assoc *1 \z \q))

The *1 is simply short-hand for “the last result of evaluation”, in this case the mappings. With the dictionary, dict, defined, we can look up simply by calling it as a function

(dict \a)
> \y

Here we see that an “a” translates to a “y”. To make a translate function all we have to do is call dict on every character and then return the resulting sequence of characters as a string:

(defn translate [in]
  (->> (map #(dict %) in) (apply str)))

Now all of the work is basically done and all we need to do is apply our translate function to the dataset and properly format our output:

(def input (-> "A-small-practice.in" reader line-seq rest))
> input

The input file to Google is cast to a reader then read as a sequence of lines, dismissing the first line which simply contains an integer which we don’t need.

With a sequence of strings in input we could simply run (map translate input) and that would produce the correct translation, however to properly format the output we need to add an index and some text:

(doseq [[idx txt] (map #(vector %1 %2) (range) s)] 
    (println (str "Case #" (inc idx) ": " (translate txt))))
Case #1: our language is impossible to understand
Case #2: there are twenty six factorial possibilities
Case #3: so it is okay if you want to just give up
Case #4: xoggk d yl wxo vkkvgo ekso uyl wonw sywy
Case #5: xkf yto akj
Case #6: yabba dabba yabba dabba yabba dabba yabba dabba yabba dabba yabba dabba yabba dabba yabba dooooooooo
Case #7: a b c d e f g h i j k l m n o p q r s t u v y w x y z now i know my abcs
Case #8: next time wont you sing with me
Case #9: for those who speak in a tongue do not speak to other people
Case #10: nobody understands them since they are speaking mysteries in the spirit
Case #11: this is so exciting i have to go the bathroom
Case #12: it was the best of times it was the blurst of times
Case #13: let lips do what hands do
Case #14: this here is gunpowder activated twenty seven caliber full auto no kickback nailthrowing mayhem
Case #15: i have bested fruit spike and moon now i shall best you the guy
Case #16: oh hai im in ur computer eating your cheezburgers and googleresing your textz
Case #17: an eye for an eye and a pigeon for a pigeon
Case #18: all your base are belong to error the spoony bard
Case #19: you pissed off the chicken lady
Case #20: now is the summer of our lack of discontent
Case #21: by the pricking of my thumbs something wicked this way comes
Case #22: in a world of direwolves and lions sometimes the rarest creature is a friend
Case #23: greetings cheese popsicle the number you have dialed is currently out of porkchops
Case #24: why do programmers always mix up halloween and christmas
Case #25: im commander shepard and this is my favorite problem on the google code jam
Case #26: f of two equals f of one equals one
Case #27: for i between three and n f of i equals f of i minus one plus f of i minus two
Case #28: how are you holding up because im a potato
Case #29: dr zaius dr zaius dr zaius dr zaius ooooooooooooh dr zaius
Case #30: whoooooooooooooooooooaaaaaaaaa i know c plus plus

Conclusion

Working in a high-level language allow you to express your thoughts very quickly and succintly. Over time, working with high-level languages, changes the way you think about problems allowing you more easily to see the bigger picture. To this end Clojure has been an excellent companion.

The above program took only a few minutes to write and never made it as far as a file, it was simply written in the repl while playing with the data.