Hadoop 메일링 리스트에 다음과 같은 내용이 올라왔는데 더그커팅이 답한 내용이다.

Lucene index 파일을 Hadoop에 저장하면 문제가 있는가?

Creating Lucene indexes directly in DFS would be pretty slow. Nutch
creates them locally, then copies them to DFS to avoid this.

One could create a Lucene Directory implementation optimized for
updates, where new files are written locally, and only flushed to DFS
when the Directory is closed. When updating, Lucene creates and reads
lots of files that might not last very long, so there's little point in
replicating them on the network. For many applications, that should be
considerably faster than either updating indexes directly in HDFS, or
copying the entire index locally, modifying it, then copying it back.

Lucene search works from HDFS-resident indexes, but is slow, especially
if the indexes were created on a different node than that searching
them. (HDFS tries to write one replica of each block locally on the
node where it is created.)

- Doug Cutting

--------------------------------------------------------

Dennis Kubes
I should have been more specific. Create the indexes using mapreduce,
then store on the dfs using the indexer job. To have clusters of
servers answer a single query we have found a best practice to be
splitting the index and associated databases into smaller pieces and
having those pieces on local file system that are fronted by distributed
search servers. Then have a search website that uses the search servers
to answer the query. An example of this setup can be found on the
NutchHadoopTutorial on the Nutch wiki.
크리에이티브 커먼즈 라이센스
Creative Commons License

Posted by 김형준


Trackback URL : http://www.jaso.co.kr/trackback/125

Comments List

  1. typos 2006/11/23 11:30 # M/D Reply Permalink

    bigTable을 만들때 hadoop,를 쓰지 말란 얘기인가? 쩝..

Leave a comment
« Previous : 1 : ... 293 : 294 : 295 : 296 : 297 : 298 : 299 : 300 : 301 : ... 388 : Next »