Batch Indexing into Offline Solr Shards
You can run the MapReduce job again, but this time without the GoLive feature. This causes the job to create a set of Solr index shards from a set of input files and write the indexes to HDFS. You can then explicitly point each Solr server to one of the HDFS output shard directories.
- Delete all existing documents in
$ solrctl collection --deletedocs collection3 $ sudo -u hdfs hadoop fs -rm -r -skipTrash /user/$USER/outdir
- Run the Hadoop MapReduce job. Be sure to replace $NNHOST in the command with your
NameNode hostname and port number, as
$ hadoop --config /etc/hadoop/conf.cloudera.mapreduce1 jar \ /usr/lib/solr/contrib/mr/search-mr-*-job.jar \ org.apache.solr.hadoop.MapReduceIndexerTool -D \ '' --log4j \ /usr/share/doc/search*/examples/solr-nrt/ --morphline-file \ /usr/share/doc/search*/examples/solr-nrt/test-morphlines/tutorialReadAvroContainer.conf \ --output-dir hdfs://$NNHOST:8020/user/$USER/outdir --verbose --solr-home-dir \ $HOME/collection3 --shards 2 hdfs://$NNHOST:8020/user/$USER/indir
- Check the job tracker status. For example, for the localhost, use http://localhost:50030/jobtracker.jsp.
- Once the job completes, check the generated index files. Individual shards are
written to the results directory as with names of the form
part-00000, part-00001,
part-00002. There are only two shards in this
$ hadoop fs -ls /user/$USER/outdir/results $ hadoop fs -ls /user/$USER/outdir/results/part-00000/data/index
- Stop Solr on each node of the
$ sudo service solr-server stop
- List the host name folders used as part of the path to each index in the SolrCloud
$ hadoop fs -ls /solr/collection3
- Move index shards into place.
- Remove outdated
$ sudo -u solr hadoop fs -rm -r -skipTrash \ /solr/collection3/$HOSTNAME1/data/index $ sudo -u solr hadoop fs -rm -r -skipTrash \ /solr/collection3/$HOSTNAME2/data/index
- Ensure correct ownership of required
$ sudo -u hdfs hadoop fs -chown -R solr /user/$USER/outdir/results
- Move the two index shards into place.
: You are moving the index shards to the two servers you set up in Preparing to Index Data.$ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00000/data/index \ /solr/collection3/$HOSTNAME1/data/ $ sudo -u solr hadoop fs -mv /user/$USER/outdir/results/part-00001/data/index \ /solr/collection3/$HOSTNAME2/data/
- Remove outdated
- Start Solr on each node of the
$ sudo service solr-server start
- Run some Solr queries. For example, for, use:*%3A*&wt=json&indent=true
Page generated September 3, 2015.
<< Batch Indexing into Online Solr Servers Using GoLive Feature | Near Real Time (NRT) Indexing Using Flume and the Solr Sink >> | |