県内図書書像マップの関連図書検索のロジックを変えてみた

WebcatPlusにISBNを喰わせて連想検索を行い、それぞれのISBNと書名をJSONで返す部分を書き直した。
以前、Web::Scraperでスプレイプしようとした時に図書情報のページの構造がtableのネストが多く、また書名やISBNの位置がかなり不定なのでちょっとムリかなと思っていたが、久しぶりにWebcat Plusが吐くHTMLを眺めたら必要な文字列にすべて同一のclass名が付いていることを発見
大雑把にこんな感じにコードを書くと

my $uri = URI->new($ISBN_SEARCH_URL_PREFIX . $isbn . $ISBN_SEARCH_URL_SUFFIX);
print Dump scraper {
    process '//a[@href=~/DocDetail/]',
    'values' => ['@href', \&get_detail];
}->scrape($uri);

sub get_detail{
    my $associative = scraper {
            process '//font[@class="fs100"]',
                'values[]' => 'TEXT';
            result 'values';
        }->scrape($_);
}

まるっといい感じで取れる

---
values:
  - ' フィリップ K.ディック著 ; 友枝康子訳 -- 早川書房, 1989.2, 375p. -- (ハヤカワ文庫 ; SF-807) '
  - '<BA33052047>'
  - ' 所蔵図書館　11館'
    〜　略　〜
  - タイトル
  - ' 流れよわが涙、と警官は言った [ナガレヨ ワガ ナミダ ト ケイカン ワ イッタ] '
  - ' (ハヤカワ文庫 ; SF-807) '
  - 責任表示
  - ' フィリップ K.ディック著 ; 友枝康子訳 '
  - 資料種別
  - ' '
    〜　略　〜
  - 形態事項
  - ' 375p ; 16cm '
  - ISBN
  - ' 4150108072 '
  - 内容著作注記
  - ' '
    〜　略　〜

必ず"タイトル"の後に書名、"ISBN"のあとにISBN(付番されてないときはないが)が存在
頭から配列舐めていって"タイトル"と"ISBN"の次のデータを引っ張るればもってこれる
フラグ作って読み飛ばしとか格好わるいしなと思ったが、List::UtilかList::MoreUtilsに何かあるはず。

first_index BLOCK LIST
Returns the index of the first element in LIST for which the criterion in BLOCK is true. Sets $_ for each item in LIST in turn:

my @list = (1, 4, 3, 2, 4, 6);
printf "item with index %i in list is 4", firstidx { $_ == 4 } @list;
__END__
item with index 1 in list is 4

Returns -1 if no such item could be found.
first_index is an alias for firstidx.
http://search.cpan.org/~vparseval/List-MoreUtils-0.22/lib/List/MoreUtils.pm

で、最終的にこう書いた。多分まだまだ直す余地あり

#!/usr/bin/perl
use FindBin::libs;
use strict;
use warnings;
use Business::ISBN;
use CGI;
use Data::Alias;
use Encode;
use JSON;
use List::MoreUtils qw( first_index );
use Readonly;
use URI;
use Web::Scraper;

Readonly my $ISBN_SEARCH_URL_PREFIX =>
   'http://webcatplus-equal.nii.ac.jp/libportal/EqualFromForm?'
 . 'hdn_from=top&txt_title=&radio_partialMatch=1&txt_author=&'
 . 'authorPartialMatch=0&txt_publisher=&txt_year1=&txt_year2=&'
 . 'txt_isbn='
 ;
Readonly my $ISBN_SEARCH_URL_SUFFIX =>
   '&txt_keyword=&check_book=on&check_magazine=on&'
 . 'select_sorttype=0&select_dmax=10&x=0&y=0'
 ;

#  Associative Search 
Readonly my $ASSOCIATIVE_SEARCH_URL_PREFIX =>
   'http://webcatplus.nii.ac.jp/assoc.cgi?hdn_mode=equal_assoc&'
 . 'select_dmax=20&check_dsel=1%2C'
;
Readonly my $BLANK => q{};

my $json  = new JSON;
$json->pretty;
my $details;
my $obj = {
    flg    => JSON::false,
};

my $q    = new CGI;
my $isbn = $q->param('isbn');

#   ISBNから連想検索用のNCID等を取得
my $uri = URI->new($ISBN_SEARCH_URL_PREFIX . $isbn . $ISBN_SEARCH_URL_SUFFIX);
my $uris = scraper {
                process '//a[@href=~/DocDetail/]',
                    'values' => ['@href', \&get_associative];
                result 'values';
}->scrape($uri);

#   連想検索の結果からそれぞれの詳細ページ
for my $url (@{$uris}) {
    $details
        = scraper {
            process '//font[@class="fs100"]',
                'values[]' => ['TEXT', \&trim];
            result 'values';
          }->scrape($url);

    set_title();
}

#   テーブルが一件でもあればTrue
if ($obj->{series}) {
    $obj->{flg} = JSON::true;
}

print $q->header(-type => "application/x-javascript; charset=utf-8");
print $json->canonical->encode($obj);

sub get_associative{
#   uriの=以降がNCID
    my (undef, $ncid) = split /=/, $_->as_string;
    my $uri = URI->new($ASSOCIATIVE_SEARCH_URL_PREFIX . $ncid);
    my $associative = scraper {
            process '//a[@href=~/DocDetail/]',
                'values[]' => '@href';
            result 'values';
        }->scrape($uri);
}

sub trim {
    my $val = $_;
       $val =~ s{\A \s* | \s* \z}{}gxm;
       $val = encode('UTF-8', $val);
    return $val;
}

sub set_title {
    my $ix   = first_index{ $_ eq 'ISBN' } @{ $details };
    return if @{ $details }[$ix + 1] =~ /^[^\d{9]/;
    my $temp = @{ $details }[ $ix + 1 ];
       $temp =~ s{ \s | [)] }{}gxm;
       $temp =~ s{ [(]      }{>}gxm;
    my @tbls = split /[,]/, $temp;
#   タイトルを取得(読みは除く)
       $ix = first_index{ $_ eq 'タイトル' } @{ $details };
    my ($title, undef) = split /\s\[/, @{ $details }[ $ix + 1 ];

    for my $tbl (@tbls) {
        my ($cd, $sub_ttl) = split /[>]/, $tbl;
        next unless $cd;
#   常にISBN-10に変換(ASINとして用いるため)
        my $isbn = Business::ISBN->new($cd)->as_isbn10->isbn;
        if ($sub_ttl) {
            ${ $obj->{series} }{ $isbn } = $title . ' ' . $sub_ttl;
        }
        else {
            ${ $obj->{series} }{ $isbn } = $title;
        }
    }

}

多分あと二回ぐらいゼロから書き直すと結構マシになりそう