Squid BlackLister - blueprint\idea

Why?

It all started with SquidGuard and the need to block websites. SquidGuard does a great job for years but it lacks one or two things and one of them is "online database update". SquidGuard is running under the url_rewrite interface of squid as an external helper and doesn't implement concurrency.

Due to this nature of SquidGuard it requires a restart of squid and there for down time for each DB update and also requires a huge ammount of workers running in paralel for a busy system to meet requests time.

A bit more about SquidGuard

SquidGuard is an ACL helper that uses as the "key" to block or allow a domain or a url or a url path. It runs a series of lookups and in many cases on couple DB bdb files.

A little note about BDB: It is an embedded database for key/value data. It means that for a programmer it is a library that allows you to work and write code with files in a specific format and API. The last time I tried ruby and other languages APIs it was based on the c binding and didn't allowed me to run write operations in concurrency. I understood it works with a lock mechanism which should allow some concurrency for reads.
Who is using it?: Many!!(openLDAP, RPM, memcachedb) The size limit of a DB file is from 2TB to 256TB and is depend on the DB page size.

How SquidGuard works?

SquidGuard communtication with squid

SquidGuard uses the url_rewrite_program interface of squid. A nice thing to know is how it works. Squid emulates stdin stdout and stderr for the running helper and communicated with it over them. squid sends to the helper a line such as "http://www.squid-cache.org/Images/img7.gif 192.168.10.131/192.168.10.131 - GET myip=192.168.10.194 myport=3128" and squid expects to recieve a line back. The answer squid expects(3.5.7) is either "ERR" or "OK status=N url=X" or "OK rewrite-url=X". ERR means no change. The others are either a redirectg with a status of 302 and similar or a transparetn rewrite of the url. The way SquidGuard blocks traffic is by transparently rewriting the url if the ACL to block was met or to respond with an "ERR" if not blocking is needed.

SquidGuard lists storage and acl testing

SquidGuard uses BDB files for black lists storage and tests these files for each request and usally more then once. It means that if I have a rule that blocks on all gambling domains but allows on all sales domains it will run couple lookups on these DB files. I do not know exactly the order but basically for the url "http://textdomanin.example.com/test_path/1/2/3/file1.jpg?uid=xyz" it can run 2-3 tests for the domain + 6 or 5 for the url per DB file. Each of the DB files holds either domain names(including IPs) or a url list. SquidGuard allows the definition of a blocked domain based on a higher level domain name or an exact one while a similar approach is being used for urls which is the exact match of the filename path or a higher level path. There are things that I know about SquidGuard but reading the code was not a part of the way I understood it and I don't know how SquidGuard handles regex's lists. I asked others and ran couple traces so I might be off by a bit or two.

SquidGuard updates process

Since SquidGuard uses BDB files it cannot be used for both read and writes at the same time or atleast not in a simple enough way. Due to the fact that it's not simple if at all possible to open the DB files for both read and writes at the same time SquidGuard requires the admin to shutdown the SquidGuard process each time an update to the DB files is required. SquidGuard runs as a process under squid and each process is independent for memory handling and CPU time and it's nice to have multiple process that will might utilize the cores better then a simple one but there is an issue with it. Each process reads the DB or rather the BDB library reads the DB and saves all sort of things in memory. In debian\ubuntu there is small script that upgrades the DB and then "force-reload" on squid. This method is the most elegant but any

SquidGuard and concurrency

Squid gives a way for a helper software to handle multiple requests at the same time. An example of how it is being used is using a "channel id". As I mentioned before the way squid communicates with the helper is using an emulated STDIN,STDOUT,STDERR and it sends a line to the helper. So the difference is that instead of sending one line and waiting for a response, squid can send to the helper a bunch of request to handle them all in his own pace. So for example if there is a way to implement concurrent processing of the requests squid will identify the request by a number while communicating with the helper. and now for the examples: squid sends to the helper "0 http://www.squid-cache.org/Images/img7.gif 192.168.10.131/192.168.10.131 - GET myip=192.168.10.194 myport=3128" and the helper responds with "0 ERR" when it's ready to answer. So squid can send a bunch of requests and expects a response with the number at any given (with some time limits)later time. SquidGuard doesn't support concrrency since it is extremly fast or oteher reason.

multiple times opening the same file

ONLINE updates

Realistics about blacklist categorizing

SquidGuard username acls hanling

it is quering the DB and saves the credentials for a specific ammount of time.

URL Testing algorithms

SquidGuard

SquidGuard basic tests are done first on domains and then on the path. In the domain db if there is a upper level domain present it's a match. It means that if the db contains gambit.com then both www.gambit.com, gambit.com and www1.gambit.com will be a match but not testgambit.com. In the path SquidGuard cut's the takes if there is a www in the url it first strips it and then run a series of test against the DB. The url lookup tests for the file and then reversivly the path untill it gets to the root path and stops. The path lookup divides into two, with "www." at the end(closer to the scheme) of the domain name or without, which means that urls in the db that starts with "www." are meaningless.

An example would be "http://www.example.com/test1/1.jpg?arg=1". SquidGuard first will convert it to "example.com/test1/1.jpg" and then test for a full match which is "example.com/test1/1.jpg" and then backwards towards the root path lookup to see if there is a match for blacklisting resulting in the next lookup "example.com/test1/" and later "example.com/". If the url would not contain "www." but other domain such as "http:/test.example.com/test1/1.jpg?arg=1" would be tested for "test.example.com/test1/1.jpg" and backwards "test.example.com/test1/" ,"test.example.com/". So no port no scheme , no "www." and no query terms in the url DB. I do not know how SquidGuard handles regex lists so it's out of the scope of this doc.

Cisco uses a similar approach in their filtering products of meraki.

Path lookup alternative.

SquidGuard tries to be effective and fast as possible for blacklists and there for doesn't take in account many things. Since it is not using scheme port and query there are couple scenarios which it cannot handle well. There are places which requires more then just a blacklist system and needs a system which will match full urls for a whitelist or blacklist. An algorithm for these cases requires another lookup aproach which I have implemented. The algorithm is a "PATH reverse lookup only" which means that it will lookup for a full url(with query terms full host and port) and then test backwards towards the root path of the url. For example "http://www.example.com/test1/1.jpg?arg=1" will be tested(first match) for:

And in a case of a port present it will not be stripped. An example for for a port preset case would be the url "http://www.example.com:8080/test1/1.jpg?arg=1" which would be tested for:

The leading slash "/" is an issue by itself since it's being used and is required. The path lookup is based on this that there is always a root path "/" present in the uri as this is it's strucutre. Also it takes in account that there is full match test before reverse testing the path. It means that to match a full path it requires the leading "/" in the path to be stripped. This effects that way we store urls in the DB.

Path + domain blacklist lookup alternative.

To allow more flexibility for a more strict environments there is a way to run the lookup for a full match first and then recursivly test and when reaching the domain test the domains black list. Such a lookup path for "http://www.example.com:8080/test1/1.jpg?arg=1" would look like this:

It's a slow lookup but it is the most resilience algorithms.

Meraki Algorightm (from their docs)

Whenever a device on the network accesses a web page, the requested URL is checked against the configured lists to determine if the request will be allowed or blocked.

Pattern matching follows these steps in order:

  1. Try to match the full URL against either list (blocked vs whitelisted patterns list)
  2. Remove the protocol and leading "www" from the URL, and check again:
  3. Remove any "parameters" (everything following a question mark) and check again:
  4. Remove paths one by one, and check each:
  5. Cut off subdomains one by one and check again:
  6. Finally, check for the special catch-all wildcard, *, in either list.

If any of the above steps produces a match, then the request will be blocked or whitelisted as appropriate. The whitelist always takes precedence over the blacklist, so a request that matches both lists will be allowed. If there is no match, the request is subject to the category filtering settings above.

tcp IP + port

CONNECT method for tunneling connections is using a destination ip + port which is another side of the algorithms. SquidGuard takes in account only the domain\IP level of the issue and there for doesn't fit for many environments which the proxy needs to be more resilience and more resilient. Which means it needs to either block all and allow from a list or allow all and block very specific tcp services. which means that there is no way to allow or block some of the ip services. This is basically a firewall level issue but a proxy needs it.

DB formats

SquidGuard

Path

SquidBlocker interfaces

- I need to change it so that the base64 decoding will be enabled using a "base64=X" query_term instead of direct interfacese.

- Database structure:
 - using dom: scheme to differ from urls and domain variable
 - using url: scheme to differ urls from main db and domain variable
 - using uri: scheme to differ urls with scheme from main db, domain and urls lists
 - using tcp: scheme to differ tcp ip+port with scheme from main db, domain and urls lists
 - using udp: scheme to differ udp ip+port with scheme from main db, domain and urls lists
 - using user_weight: to store each user allowed weight
 - using group_weight: to store each group allowed weight
 - 
[ ] /control (auth)
[X] /control/dunno_mode 
[X] /control/dunno_filp
[X] /control/db_stop
[X] /control/db_start
[X] /control/db_status

- UI path
[ ] /ui/

 - SquidGuard search algorightm as IF there is one big blocklist with both domains and pathes
[X] /sg/domain 
[ ] /sg/path_only
[ ] /sg/url
[ ] /sg/url_01

[ ] /meraki/url (one big blacklist)
[ ] /meraki/url_01(mixed white and blacklist)

 * Needs to fix the issues with trailing "/" for couple test cases
[X] /sb/01/url (needs a better testing)
[X] /sb/01/tcp
[ ] /sb/url_nolist
[ ] /sb/url_nolist_01
[ ] /sb/urlwithdomlist_path
[ ] /sb/urlwithdomlist_path_01
[ ] /sb/dom_bl_01
[ ] /sb/tcp_ip_port_01 (uses the "ip:port" format or can be used with a CONNECT scheme from a uri and the DB storage can be tcp:ip:port)
[ ] /sb/safe_search_force/url/

[ ] /sb/weight_url_by_bar
[ ] /sb/weight_url_by_user (bar stored in the db under user_weight: prefix)
[ ] /sb/weight_url_by_group (var bar stored in the db under group_weight: prefix)
[ ] /sb/weight_dom_by_bar
[ ] /sb/weight_dom_by_user (bar stored in the db under user_rate: prefix)
[ ] /sb/weight_dom_by_group (var bar stored in the db under group_rate: prefix)

[ ] /sb/first_match_path1
[ ] /sb/first_match_path2
[ ] /sb/youtube/id/ (block or allow by video or image id)
[ ] /sb/youtube/user/


- The /db/get* are with a query term key
[X] /db/get (by key),(url.Unescape by default)
[X] /db/get_base64 (by key)

- all the /db/put* are with query term val and prefix, and needs debug sections.
[X] /db/put (url.Unescape by default)
[X] /db/put_batch (with and without prefix)

[X] /db/set (url.Unescape by default)
[ ] /db/set_dom (url.Unescape by default)
[ ] /db/set_user (url.Unescape by default)
[ ] /db/set_url
[ ] /db/set_uri
[X] /db/set_base64

[X] /db/del (url.Unescape by default)
[X] /db/del_batch (is there prefix compatibility?)
[X] /db/del_base64
[ ] /db/replicate_key/tohost
[ ] /db/fullsync/tohost (should be done with rsync on dunno mode)
[ ] /db/peers

- Tests to run
/sg/X
[X]domain 
[ ]/sg/path_only_base64 

/db/X
[ ]insert key + val using(put or set)
[ ]fetch key (using get)
[ ]del key
[ ]put batch




- Test urls:
http://www.example.com/test1/1.jpg?arg=1
http://www.example.com:8080/test1/1.jpg?arg=1
http://www.example.com:8080/test1/test2
http://www.example.com:8080/test1/test2/
tcp:ip:port

- There is a need to test if squid sends port 

- Change to use "net.SplitHostPort" instead of manually parsing and by the error set the host and port.

- NEED to test hammer7

squidblocker for who it meant?

SquidBocker is there for sysadmins that wants to stay up for a very long time without the need to restart or reload the server for a very long period of time. One example is an hospitel which human lives are mandatory.

squigblocker hub\broadcaster

purge hub to varnish and squid send multiple PUT and or GET requests