The dCache Pool (MultiProtocolPool2)

Scope of this document

This document is intended to describe the tasks and the assumed behavior of the dCache MultiprotocolPool (v2). Moreover it briefly describes how to configure and to manage a pools from the administrator point of view.

Task Description

The MultiProtocolPool (v2) essentially

Content


Restore Handler Commands

The 'restore handler' is a subsystem of the PoolManager cell. It allows to view, retry and cancel ongoing restore processes.
   (local) admin > cd PoolManager
   (PoolManager) admin >
   

rc ls
Displays a list of all restore processes. Format :
   0033000000000000006C3928@131.169.0.0/255.255.0.0 m=1 [0]  [zeus-doener-14] [Staging 08.25 22:52:41] {0,}
   
   <pnfsId>@<netMask> m=<# of clients> [<retries>] [<pool>] [<Status>] {<errorCode>:<ErrorMessage>}
   

In case a request has a non zero error status the rc commands can cancel the request or let the retry.

rc retry <pnfsId>@<netMask>

The specified request will be retried. Retrying a restore request only makes sense if the reason for the failure has been solved. In most cases the failure has been caused by the HSM. Before the request goes into an error status it retries automatically using different pools (if available).
rc failed <pnfsId>@<netMask> [<errorCode> [<errorMessage>]
rc failed cancels the request and (if specified) sends the error message and the error code to the requesting client.

Setting Up Pools

Create the data and control directory.
cd <Pool-Base-Directory>
mkdir control data
Set up the 'setup' file. Create the 'setup' file in the <Pool-Base-Directory> directory. Adjust the variables below according to your needs.

set max diskspace <PoolSize>

<PoolSize> : Defines the size of this Pool. The postfixs k,m,g can be used to specify 1024,1024*1024,1024*1024*1024 bytes respectivly. As a rule of thumb :
   set max diskspace  (<Kbytes from 'df -k'> / 1024 / 1024 - 5 )g
   

set heartbeat 120

Setting the heartbeat timer.

mover set max active <maxNumberOfClientMovers>

<maxNumberOfClientMovers> : Maximum number of movers talking to clients. If this number of movers is exceeded, the additional ones are queued.

rh set max active <maxNumberOfRestoreMovers>

<maxNumberOfRestoreMovers> : Maximum number of movers fetching data from HSM into the cache. If this number of movers is exceeded, the additional ones are queued.

rh set max active <maxNumberOfStoreMovers>

<maxNumberOfStoreMovers> : Maximum number of movers storing data from the cache into the HSM(s). If this number of movers is exceeded, the additional ones are queued.

hsm set osm -pnfs=<pnfsMountPoint>

<pnfsMountPoint> : Points to the pnfs mountpoint.

hsm set osm -command=<d-cache-base>/jobs/<hsmCopyAppliction>

<pnfsMountPoint> : Points to a script or binary which copies dCache file to/from the HSM.
Make the entry in the domain - pool mapping file :
Add the following line to the domain - pool file : <d-cache-base>/config/<pool-map-file>.poollist

  <poolName> <Pool-Base-Directory> <options>
Options are
Restart the pool domain :
   cd <d-cache-base>/jobs
   ./pool -pool=<pool-map-file> stop
   ./pool -pool=<pool-map-file> start

Shutting down individual pools

It might become necessary to shutdown individual pools while the rest of the system is still functional. The goal is not to interrupt any ongoing data transfer but to disallow new transfers to start for that particualar pool.
Therefor that pool first has to be disabled.
    #
    # log into the pool
    #
    (local) admin > cd fooPool
    (fooPool) admin > pool disable
  

From now on, the pool will no longer reply on any kind of storage requests from the StorageManager. Therefor the different mover queues are draining step by step. The pool command info provides an overview of the 3 queue types.
    ...... snip .......
    
    
      Storage Queue     : 
         Classes  : 6
         Requests : 19
      Mover Queue 8(15)/0
      StorageHandler [diskCacheV111.pools.HsmStorageHandler2]
        Version        : [$Id: PoolUsersGuide.html,v 1.1 2005/04/29 12:38:59 patrick Exp $]
       Sticky allowed  : true
        Job Queues 
          to store   0(10)/0
          from store 0(4)/0


     .... snap .........
  
The syntax is
  <queue type> <active>(<maxAllowed>)<waiting>
  
The pools shouldn't be stopped until all three queues are drained, which means no waiting or active jobs any more.
In case a hole pool domain has to be shutdown, all pools within the domain have to be drained first. The queueInfo web page gives an insight into the queue status of all queues in one picture, which simplifies waiting for multiple queue to become ready for shutdown.

Special treatment of write pools

Theoretically the desciption above applies for write pools as well. But, depending on the settings of the HSM storage subsystem, it is very likely that, even if all queues are empty, there are still precious datases in the pool repository, which means files, not yet written to the HSM. This is not at all a problem. They will be flushed a soon as the pool is up again. In case the administrator needs to make sure that there are no remaining precious files on disk, these file have to be flushed manually.

    #
    # Get the list of storage queues :
    #
    (fooPool) admin > queue ls -l queue
    #
    # for each storage class (file family)
    # enforce the store process by :
    #
    (fooPool) admin > flush class <hsmType> <storageClass>
    #
    #  eg: (fooPool) admin > flush class osm h1:rawData-2000
    #
  
After all these store processes are finished, the write pool can be shut down.


Pool Command Reference Manuals

 mover remove <jobId>
 mover ls [-binary [jobId] ]
 mover kill <jobId>
 mover set max active <maxActiveIoMovers>
 rep ls [-l[=s,l,u,nc,p]] [-s[=kmgt]] | [<pnfsId> [...] ]
 rep lock <pnfsId> [on | off | <time/sec>]
 rep rm <pnfsid> [-force]# removes the pnfsfile from the cache
 rep set sticky <pnfsid> on|off
 rep set bad <pnfsid> on|off
 hsm remove <hsmName>
 hsm ls [<hsmName>] ...
 hsm unset <hsmName> [-<key>] ... 
 hsm set <hsmName> [-<key>=<value>] ... 
 sweeper ls  [-l] [-s]
 sweeper free <bytesToFree>
 set max diskspace <space>[<unit>] # unit = k|m|g
 set breakeven <breakEven> # free and recovable space
 set flushing interval DEPRECATED (use flush set interval <time/sec>)
 set cleaning interval <interval/sec>
 set heartbeat <heartbeatInterval/sec>
 set sticky allowed|denied
 set gap <always removable gap>/bytes
 pool enable  # resume sending up messages'
 pool disable [<errorCode> [<errorMessage>]] # suspend sending 'up messages'
 rh jobs remove <jobId>
 rh jobs ls 
 rh jobs kill <jobId>
 rh ls [<pnfsId>]
 rh restore <pnfsId>
 rh set max active <maxActiveHsmMovers>
 st jobs remove <jobId>
 st jobs ls 
 st jobs kill <jobId>
 st ls [<pnfsId>]
 st set max active <maxActiveHsmMovers>
 flush ls 
 flush pnfsid <pnfsid>
 flush class <hsm> <storageClass>
 flush set max active <maxActiveFlush's>
 flush set retry delay <errorRetryDelay>/sec
 flush set interval <flushing check inteval/sec>
 queue remove pnfsid <pnfsId> # !!!! DANGEROUS
 queue remove class <hsm> <storageClass>
 queue ls queue  [-l]
 queue ls classes  [-l}
 queue suspend class <hsm> <storageClass> | *
 queue activate <pnfsId>  # move pnfsid from <failed> to active
 queue resume class <hsm> <storageClass> | *
 queue deactivate <pnfsId>  # move pnfsid from <active> to <failed>
 queue define class <hsm> <storageClass> [-expire=<expirationTime/sec>] [-total=<maxTotalSize/bytes>] [-pending=<maxPending>] 
 exec context  <arg-0>
 pp remove <id>
 pp ls  # get the list of companions
 pp keep on|off
 pp get file <pnfsId> <pool>
 pp set port <listenPort>
 save  # saves setup to disk
 pnfs unregister  # remove entry of all files from pnfs
 pnfs register  # add entry of all files into pnfs
 info [-l|-a]
 get cost  [filesize] # get space and performance cost
 show pinboard [<lines>] # dumps the last <lines> to the terminal
 pf <pnfsId>