It all started with SquidGuard and the need to block or allow websites. SquidGuard does a great job for years but it lacks couple things and one of them is "online database update".
SquidBlocker is there for sysadmins that needs to stay up for a very long periods of time without the need to restart or reload the DB server. One example is an hospitel which human lives are mandatory and shutting down the service might lead into unwanted situations.
#!/usr/bin/env bash
cat > /etc/yum.repos.d/squid.repo <<EOF
[squid]
name=Squid repo for CentOS Linux
#IL mirror
baseurl=http://www1.ngtech.co.il/repo/centos/\$releasever/\$basearch/
failovermethod=priority
enabled=1
gpgcheck=0
EOF
yum update -y
yum install httpd php squidblocker -y
sed -i -e 's/AllowOverride\ None/AllowOverride All/g' /etc/httpd/conf/httpd.conf
sed -i -e 's/AllowOverride\ none/AllowOverride All/g' /etc/httpd/conf/httpd.conf
sed -i -e 's@DocumentRoot\ \"\/var\/www\/html\"@DocumentRoot "/var/www/squidblocker"@g' /etc/httpd/conf/httpd.conf
ln -s /var/www/block_page /var/www/squidblocker/block_page
systemctl enable httpd
systemctl enable sbserver
systemctl restart httpd
systemctl start sbserver
Since HA and LB is a part of high uptime I wrote a small reverse proxy to allow mirroring updates and changes across multiple DB hosts. You can configure the systemd script to use a comma separated string which declares on the DBs hosts. The server is never returning any information about the success of the action but there is stdout output which will be throwen if there is an issue contacting any of the peers. It can also work as a PURGE HUB to send to multiple hosts when a key update is being done.
File "/usr/sbin/sblocker_http_hub"
Systemd service: sbhub
Since SquidBlocker is an http server there is an option to use squid, varnish or nginx as a reverse cache proxy. The HUB can help to send PURGE,PUT or SET messages to multiple host.
I am providing a squid external ACL helper that support concurrency and works with the next settings:
external_acl_type filter_dom ipv4 concurrency=50 ttl=3 %URI /usr/bin/sblocker_client -http=http://127.0.0.1:8080/sb/01
acl filter_dom_acl external filter_url
deny_info http://<SERVER_NAME>/block_page/?url=%u&dom=%H filter_url_acl
acl localnet src 192.168.0.0/16
http_access deny !filter_url_acl
http_access allow localnet filter_dom_url
SquidBlocker "Hammer" run batch updates to the DB. It was written due to the fact that it runs updates 5% of the time that of a single request per update. It uses a "PUT" request to black or whitelist a list of domains urls or tcp_ip_port list. An example of usage:
/usr/bin/sblocker_hammer -f="BL/porn/domains" -http="http://127.0.0.1:8080/db/put_batch/?val=1&prefix=dom:" -test="http://127.0.0.1:8080/sb/01/url/?url=" -t="http://block.test.example.com/?test1=1"
/usr/bin/sblocker_hammer -f="BL/porn/urls" -http="http://127.0.0.1:8080/db/put_batch/?val=1&prefix=url:http://" -test="http://127.0.0.1:8080/sb/01/url/?url=" -t="http://block.test.example.com/?test1=1"
SquidGuard is an ACL helper that uses BDB files and the "key" to block or allow a domain or a url path(without query terms). It runs a series of lookups and in many cases on a bunch(10+) DB bdb files.
A little note about BDB: It is an embedded database for key/value data. It means that for a programmer it is a library that allows you to work and write code with files in a specific format and API. The last time I tried ruby and other languages APIs it was based on the c binding and didn't allowed me to run write operations in concurrency. I understood it works with a lock mechanism which should allow some concurrency for reads.
Who is using it?: Many!!(openLDAP, RPM, memcachedb) The size limit of a DB file is from 2TB to 256TB and is depend on the DB page size.
SquidGuard is quering the DB and saves the credentials for a specific ammount of time.
SquidGuard basic tests are done first on domains and then on the path. In the domain db if there is a upper level domain present it's a match. It means that if the db contains gambit.com then both www.gambit.com, gambit.com and www1.gambit.com will be a match but not testgambit.com. In the path SquidGuard cut's the takes if there is a www in the url it first strips it and then run a series of test against the DB. The url lookup tests for the file and then reversivly the path untill it gets to the root path and stops. The path lookup divides into two, with "www." at the end(closer to the scheme) of the domain name or without, which means that urls in the db that starts with "www." are meaningless.
An example would be "http://www.example.com/test1/1.jpg?arg=1". SquidGuard first will convert it to "example.com/test1/1.jpg" and then test for a full match which is "example.com/test1/1.jpg" and then backwards towards the root path lookup to see if there is a match for blacklisting resulting in the next lookup "example.com/test1/" and later "example.com/". If the url would not contain "www." but other domain such as "http:/test.example.com/test1/1.jpg?arg=1" would be tested for "test.example.com/test1/1.jpg" and backwards "test.example.com/test1/" ,"test.example.com/". So no port no scheme , no "www." and no query terms in the url DB. I do not know how SquidGuard handles regex lists so it's out of the scope of this doc.
Cisco uses a similar approach in their filtering products of meraki.
SquidGuard tries to be effective and fast as possible for blacklists and there for doesn't take in account many things. Since it is not using scheme port and query there are couple scenarios which it cannot handle well. There are places which requires more then just a blacklist system and needs a system which will match full urls for a whitelist or blacklist. An algorithm for these cases requires another lookup aproach which I have implemented. The algorithm is a "PATH reverse lookup only" which means that it will lookup for a full url(with query terms full host and port) and then test backwards towards the root path of the url. For example "http://www.example.com/test1/1.jpg?arg=1" will be tested(first match) for:
And in a case of a port present it will not be stripped. An example for for a port preset case would be the url "http://www.example.com:8080/test1/1.jpg?arg=1" which would be tested for:
The leading slash "/" is an issue by itself since it's being used and is required. The path lookup is based on this that there is always a root path "/" present in the uri as this is it's strucutre. Also it takes in account that there is full match test before reverse testing the path. It means that to match a full path it requires the leading "/" in the path to be stripped. This effects that way we store urls in the DB.
Whenever a device on the network accesses a web page, the requested URL is checked against the configured lists to determine if the request will be allowed or blocked.
Pattern matching follows these steps in order:
If any of the above steps produces a match, then the request will be blocked or whitelisted as appropriate. The whitelist always takes precedence over the blacklist, so a request that matches both lists will be allowed. If there is no match, the request is subject to the category filtering settings above.
To allow more flexibility for a more strict environments there is a way to run the lookup for a full match first and then recursivly test and when reaching the domainm, test the domain vs the domains black list. Such a lookup path for "http://www.example.com:8080/test1/1.jpg?arg=1" would look like this:
It's a "slow" lookup but it is the most resilience algorithms.
For IP based hosts there is only one lookup in the domains blacklist.
CONNECT method for tunneling connections is using a destination ip or domain + port which is another side of the algorithms. SquidGuard takes in account only the domain\IP level of the issue and there for doesn't fit for many environments which the proxy needs to be more resilience and more resilient. Which means it needs to either block all and allow from a list or allow all and block very specific tcp services. which means that there is no way to allow or block some of the ip services. This is basically a more firewall level issue but a squid proxy needs it. The number of checks are two:
SquidGuard uses two different lists with different charactaristics in mind. In a way to match SquidGuard lists logic I am using two schemes:
* Things to be done:
- I need to change it so that the base64 decoding will be enabled using a "base64=X" query_term instead of direct interfacese.
- Database structure:
- using dom: scheme to differ from urls and domain variable
- using url: scheme to differ urls from main db and domain variable
- using user_weight: to store each user allowed weight
- using group_weight: to store each group allowed weight
- Tests to run
- Test urls:
http://www.example.com/test1/1.jpg?arg=1
http://www.example.com:8080/test1/1.jpg?arg=1
http://www.example.com:8080/test1/test2
http://www.example.com:8080/test1/test2/
tcp:ip:port
[X] There is a need to test if squid sends port
- it sends ip:port or domain:port
- Change to use "net.SplitHostPort" instead of manually parsing and using the error set the host and port.
[X] NEED to test hammer7
[X] Squid Conf example for the client
* A list of http interfaces
[ ] /control (auth)(auth can be done using a reverse proxy)
[X] /control/dunno_mode
[X] /control/dunno_filp
[X] /control/db_stop
[X] /control/db_start
[X] /control/db_status
- UI path
[ ] /ui/
- SquidGuard search algorightm as IF there is one big blocklist with both domains and pathes
[X] /sg/domain
[ ] /sg/path_only
[ ] /sg/url
[ ] /sg/url_01
[ ] /meraki/url (one big blacklist)
[ ] /meraki/url_01(mixed white and blacklist)
* Needs to fix the issues with trailing "/" for couple test cases
[X] /sb/01/url (needs a better testing)
[X] /sb/01/tcp
[ ] /sb/url_nolist
[ ] /sb/url_nolist_01
[ ] /sb/urlwithdomlist_path
[ ] /sb/urlwithdomlist_path_01
[ ] /sb/dom_bl_01
[ ] /sb/tcp_ip_port_01 (uses the "ip:port" format or can be used with a CONNECT scheme from a uri and the DB storage can be tcp:ip:port)
[ ] /sb/safe_search_force/url/
[ ] /sb/weight_url_by_bar
[ ] /sb/weight_url_by_user (bar stored in the db under user_weight: prefix)
[ ] /sb/weight_url_by_group (var bar stored in the db under group_weight: prefix)
[ ] /sb/weight_dom_by_bar
[ ] /sb/weight_dom_by_user (bar stored in the db under user_rate: prefix)
[ ] /sb/weight_dom_by_group (var bar stored in the db under group_rate: prefix)
[ ] /sb/first_match_path1
[ ] /sb/first_match_path2
[ ] /sb/youtube/id/ (block or allow by video or image id)
[ ] /sb/youtube/user/
- The /db/get* are with a query term key
[X] /db/get (by key),(url.Unescape by default)
[X] /db/get_base64 (by key)
- all the /db/put* are with query term val and prefix, and needs debug sections.
[X] /db/put (url.Unescape by default)
[X] /db/put_batch (with and without prefix)
[X] /db/set (url.Unescape by default)
[ ] /db/set_dom (url.Unescape by default)
[ ] /db/set_user (url.Unescape by default)
[ ] /db/set_url
[ ] /db/set_uri
[X] /db/set_base64
[X] /db/del (url.Unescape by default)
[X] /db/del_batch (is there prefix compatibility?)
[X] /db/del_base64
[ ] /db/distribute_key/tohost
[ ] /db/fullsync/tohost (should be done with rsync on dunno mode)
[ ] /db/peers
- Tests to run
/sg/X
[X]domain
[ ]/sg/path_only_base64
/db/X
[ ]insert key + val using(put or set)
[ ]fetch key (using get)
[ ]del key
[ ]put batch
Copyright (c) 2015, Eliezer Croitoru All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.