I have just released version 1.0.0 of hfind on github.

hfind is a find utility for Hadoop. It implements most of the POSIX specification, so you should be able to use it pretty much as you use find(1) (although most of the superset primaries added by the GNU or BSD versions are not yet implemented).

Here is for instance a Bash snippet I use to cleanup some of my directories on HDFS:

#!/bin/bash -e

LOG=$HOME/log/hadoop-cleanup.$(date "+%Y%m%d")

cleanup_directory() {
    local dir=$1
    local cutoff=$2

    $HFIND $dir -type f -mtime -$cutoff | xargs $HADOOP fs -rm >> $LOG
    $HFIND $dir -mtime -$cutoff -empty | xargs $HADOOP fs -rmr >> $LOG

# Keep the last 7 days of test data
cleanup_directory /user/pierre/test 7

# Keep the last 6 months of rollups
cleanup_directory /user/pierre/rollups 186

The following primaries, some originally requested in HDFS-227, have been implemented:

  • -type [f|d]
  • -atime and -mtime, support both + and – arguments
  • -depth n and -d
  • -owner/-group/-nouser/-nogroup
  • -name, which supports globing and regex
  • -size
  • -empty

The following ones have been scheduled for the next release:

  • -print0 (for piping to xargs -0)
  • -perm
  • -delete
  • parens support

On the last point, -a and -o have been implemented though, i.e. you can do

find / -nouser -o -nogroup -a -name test.dat

but not yet

find / \( -nouser -o -nogroup \) -a -name test.dat

Note that -delete is not implemented yet. I first want to get more feedback and testing. This should be fixed by 1.0.1.

To get started on hfind, read the main page at http://github.com/pierre/hfind.

The direct download link for the 1.0.0 release is http://github.com/downloads/pierre/hfind/metrics.hfind-1.0.0.tar.gz.

Happy finding!

2 comments so far

Add Your Comment
  1. As long as you are only cleaning your own files it may work. But if your file names are generated by users, you need to deal with surprising file names containing space, ‘, or ” in the filename.

    xargs can lead to nasty surprises caused by the separator problem

    GNU Parallel http://nd.gd/0s may be better.

    • Yes, the xargs example is definitively suboptimal.

      hfind now supports -print0 (see this commit).
      -delete is implemented as well (see this commit), which should be safer and faster than piping it to hadoop -rm.

      Both options are available in the master branch (git://github.com/pierre/hfind.git). 1.0.1 should be out hopefully soon.