This document is intended to describe the tasks and the assumed behavior of the dCache MultiprotocolPool (v2). Moreover it briefly describes how to configure and to manage a pools from the administrator point of view.
The MultiProtocolPool (v2) essentially
- moves data between the cache and the clients as well as from the HSM to the cache and vice versa.
- manages the free resp. occupied space on disk and makes space available to newly arriving data.
The 'restore handler' is a subsystem of the PoolManager cell. It allows to view, retry and cancel ongoing restore processes.(local) admin > cd PoolManager (PoolManager) admin >
rc lsDisplays a list of all restore processes. Format :0033000000000000006C3928@131.169.0.0/255.255.0.0 m=1 [0] [zeus-doener-14] [Staging 08.25 22:52:41] {0,} <pnfsId>@<netMask> m=<# of clients> [<retries>] [<pool>] [<Status>] {<errorCode>:<ErrorMessage>}
In case a request has a non zero error status the rc commands can cancel the request or let the retry.rc retry <pnfsId>@<netMask>
The specified request will be retried. Retrying a restore request only makes sense if the reason for the failure has been solved. In most cases the failure has been caused by the HSM. Before the request goes into an error status it retries automatically using different pools (if available).rc failed <pnfsId>@<netMask> [<errorCode> [<errorMessage>]rc failed cancels the request and (if specified) sends the error message and the error code to the requesting client.
Create the data and control directory.Set up the 'setup' file. Create the 'setup' file in the <Pool-Base-Directory> directory. Adjust the variables below according to your needs.cd <Pool-Base-Directory> mkdir control dataset max diskspace <PoolSize>
<PoolSize> : Defines the size of this Pool. The postfixs k,m,g can be used to specify 1024,1024*1024,1024*1024*1024 bytes respectivly. As a rule of thumb :set max diskspace (<Kbytes from 'df -k'> / 1024 / 1024 - 5 )gset heartbeat 120
Setting the heartbeat timer.mover set max active <maxNumberOfClientMovers>
<maxNumberOfClientMovers> : Maximum number of movers talking to clients. If this number of movers is exceeded, the additional ones are queued.rh set max active <maxNumberOfRestoreMovers>
<maxNumberOfRestoreMovers> : Maximum number of movers fetching data from HSM into the cache. If this number of movers is exceeded, the additional ones are queued.rh set max active <maxNumberOfStoreMovers>
<maxNumberOfStoreMovers> : Maximum number of movers storing data from the cache into the HSM(s). If this number of movers is exceeded, the additional ones are queued.Make the entry in the domain - pool mapping file :hsm set osm -pnfs=<pnfsMountPoint>
<pnfsMountPoint> : Points to the pnfs mountpoint.hsm set osm -command=<d-cache-base>/jobs/<hsmCopyAppliction>
<pnfsMountPoint> : Points to a script or binary which copies dCache file to/from the HSM.Add the following line to the domain - pool file : <d-cache-base>/config/<pool-map-file>.poollistRestart the pool domain :
<poolName> <Pool-Base-Directory> <options>Options are
- recover-space : instructs the pool, on startup, to recover from overbooking of disk space. If this option is given, a pool will remove files until the remaining datasets fit into the size specified in set max diskspace .... If this option is missing, the pool won't start if the space if overbooked. Even if this option is present, the pool will not start, if the overbooking exceeds 10 percent.
- recover-control : instructs the pool, on startup, to recover from invalid control files. This may even result in removing the datafile if it has not been completly copied to disk. This options is honoured only if the pool can make sure that the file is already stored to tape. In cases where this is unclear or not determinable the pool won't become ready.
- recover-anyway : This option is still experimental. . It instructs the pool, on startup, to mark all files as bad where the pools is not able to determine the proper status. These file are invisible to the dCache system but can be inspected by rep ls -l=b.
- sticky=allowed|denied : Instructs the pool to honour, resp. ignore the 'set sticky' message from control modules. The pool command rep set sticky on|off is not affected. The pool command set sticky allowed|denied overwrites this option.
cd <d-cache-base>/jobs ./pool -pool=<pool-map-file> stop ./pool -pool=<pool-map-file> start
Shutting down individual pools
It might become necessary to shutdown individual pools while the rest of the system is still functional. The goal is not to interrupt any ongoing data transfer but to disallow new transfers to start for that particualar pool.
Therefor that pool first has to be disabled.# # log into the pool # (local) admin > cd fooPool (fooPool) admin > pool disable
From now on, the pool will no longer reply on any kind of storage requests from the StorageManager. Therefor the different mover queues are draining step by step. The pool command info provides an overview of the 3 queue types....... snip ....... Storage Queue : Classes : 6 Requests : 19 Mover Queue 8(15)/0 StorageHandler [diskCacheV111.pools.HsmStorageHandler2] Version : [$Id: PoolUsersGuide.html,v 1.1 2005/04/29 12:38:59 patrick Exp $] Sticky allowed : true Job Queues to store 0(10)/0 from store 0(4)/0 .... snap .........The syntax is<queue type> <active>(<maxAllowed>)<waiting>The pools shouldn't be stopped until all three queues are drained, which means no waiting or active jobs any more.
In case a hole pool domain has to be shutdown, all pools within the domain have to be drained first. The queueInfo web page gives an insight into the queue status of all queues in one picture, which simplifies waiting for multiple queue to become ready for shutdown.Special treatment of write pools
Theoretically the desciption above applies for write pools as well. But, depending on the settings of the HSM storage subsystem, it is very likely that, even if all queues are empty, there are still precious datases in the pool repository, which means files, not yet written to the HSM. This is not at all a problem. They will be flushed a soon as the pool is up again. In case the administrator needs to make sure that there are no remaining precious files on disk, these file have to be flushed manually.
# # Get the list of storage queues : # (fooPool) admin > queue ls -l queue # # for each storage class (file family) # enforce the store process by : # (fooPool) admin > flush class <hsmType> <storageClass> # # eg: (fooPool) admin > flush class osm h1:rawData-2000 #After all these store processes are finished, the write pool can be shut down.
Pool Command Reference Manuals
mover remove <jobId> mover ls [-binary [jobId] ] mover kill <jobId> mover set max active <maxActiveIoMovers> rep ls [-l[=s,l,u,nc,p]] [-s[=kmgt]] | [<pnfsId> [...] ] rep lock <pnfsId> [on | off | <time/sec>] rep rm <pnfsid> [-force]# removes the pnfsfile from the cache rep set sticky <pnfsid> on|off rep set bad <pnfsid> on|off hsm remove <hsmName> hsm ls [<hsmName>] ... hsm unset <hsmName> [-<key>] ... hsm set <hsmName> [-<key>=<value>] ... sweeper ls [-l] [-s] sweeper free <bytesToFree> set max diskspace <space>[<unit>] # unit = k|m|g set breakeven <breakEven> # free and recovable space set flushing interval DEPRECATED (use flush set interval <time/sec>) set cleaning interval <interval/sec> set heartbeat <heartbeatInterval/sec> set sticky allowed|denied set gap <always removable gap>/bytes pool enable # resume sending up messages' pool disable [<errorCode> [<errorMessage>]] # suspend sending 'up messages' rh jobs remove <jobId> rh jobs ls rh jobs kill <jobId> rh ls [<pnfsId>] rh restore <pnfsId> rh set max active <maxActiveHsmMovers> st jobs remove <jobId> st jobs ls st jobs kill <jobId> st ls [<pnfsId>] st set max active <maxActiveHsmMovers> flush ls flush pnfsid <pnfsid> flush class <hsm> <storageClass> flush set max active <maxActiveFlush's> flush set retry delay <errorRetryDelay>/sec flush set interval <flushing check inteval/sec> queue remove pnfsid <pnfsId> # !!!! DANGEROUS queue remove class <hsm> <storageClass> queue ls queue [-l] queue ls classes [-l} queue suspend class <hsm> <storageClass> | * queue activate <pnfsId> # move pnfsid from <failed> to active queue resume class <hsm> <storageClass> | * queue deactivate <pnfsId> # move pnfsid from <active> to <failed> queue define class <hsm> <storageClass> [-expire=<expirationTime/sec>] [-total=<maxTotalSize/bytes>] [-pending=<maxPending>] exec context <arg-0> pp remove <id> pp ls # get the list of companions pp keep on|off pp get file <pnfsId> <pool> pp set port <listenPort> save # saves setup to disk pnfs unregister # remove entry of all files from pnfs pnfs register # add entry of all files into pnfs info [-l|-a] get cost [filesize] # get space and performance cost show pinboard [<lines>] # dumps the last <lines> to the terminal pf <pnfsId>