Saturday, November 23, 2013

Data import options with Elasticsearch

Architecturally there are two approaches for dataload, at the outset, you will have to decide between "push" vs "pull" model based on your requirements and performance goals, in this article we will explore ES dataload options for both of these categories.

I have sourced much of this information from ES mailing group , in fact this is a compilation of everything that I found on the ES mailing list while I was researching on this topic and did not find any tutorial or article that has a comprehensive information on this topic.

Before we jump deep into the topic there are few basic things to remember when it comes to indexing the data in ES,  with ES, the best load performance is with more shards, and best query performance is with more replicas, so you need to find a sweet spot with your setup, In ES all indexing goes through the primary shards , it is important that you follow an iterative approach to data indexing needs to arrive at a sweet spot, don't start with tuning at first place, instead let tuning recommendations trickle down based on what you learn from your setup and do remember that It takes significantly more time to index on an existing index than on an empty index

Pull Model


River plugin 

These are built as custom plugin code that can be deployed within ES and runs within the ES node, they are a good fit when you are expecting a constant flow of data that needs to be indexed and you don't want to write another external application to push data into ES for indexing. a very good use case if when you are indexing analytics and server logs or data coming out of nosql store like cassendra or mongodb.

River plugins also support import using Bulk API, this is useful in cases where the river plugin
can accumulate the data for certain threshold before performing an import / indexing, since the client is running within the ES node it is cluster aware.

Push Model

curl -XPUT
This is perhaps the simplest way to index a document, you just perform a PUT on a REST endpoint,
this works best during during development phase to index documents for performing few quick validations from command line.
curl -XPOST 'http://127.0.0.1:9200/test' -d '{"partnumber":"HLG028_281201","name":"Modern Houseware Hanging Lamp","shortdescription":"A red hanging lamp in a triangular shape.","longdescription":"A hanging lamp with red ambient shades to add a romantic mood to your room. Perfect for your bedroom or your children's room. Easy set up so you do not have to pay electricians to set it up."}'

UDP Bulk API
Connectionless datagram protocol. This is faster but not so reliable as you don't have any acknowledgement of success or failure. 
E.g. cat bulk.txt | nc -w 0 -u localhost 9700
HTTP Bulk API
if you have an external application that consolidates the data in a timely manner
and then formats it to JSON to be indexed. This is much more reliable as compared to UDP bulk import as you get an acknowledgement of index operation and can take corrective steps based on the response.

Java TransportClient bulk indexing 

Can be used within a custom ETL load that runs outside of ES nodes, you can connect to ES node from a remote host, you can index with multiple threads it saves a bit of HTTP overhead by using the native ES protocol, Bulk is always best as it would try and group the requests per shard and minimize the network round trips, Transport Client is thread safe and it is built to be reused by several threads, while doing bulk load coding do ensure you do not create Transport client in a loop, instead send all the requests through one TransportClient instance per JVM, perhaps create TransportClient as a singleton.

Internally the Transport client sends each request asynchronously and is thread safe
Another nice thing about using a Transportclient is that it will automatically internally round robin to a ES node, and then that node will spread the bulk requests to the respective "shard bulks"

Here is a sample snippet that can be used for connecting to the ES cluster.

   ImmutableSettings.Builder clientSettings = ImmutableSettings.settingsBuilder()
              .put("http.enabled", "false")
              .put("discovery.zen.minimum_master_nodes", 1)
              .put("discovery.zen.ping.multicast.ttl", 4)
              .put("discovery.zen.ping_timeout", 100)
              .put("discovery.zen.fd.ping_timeout", 300)
              .put("discovery.zen.fd.ping_interval", 5)
              .put("discovery.zen.fd.ping_retries", 5)
              .put("client.transport.ping_timeout", "10s")
              .put("multicast.enabled", false)
              .put("discovery.zen.ping.unicast.hosts", esHosts)
              .put("cluster.name", esClusterName)
              .put("index.refresh_interval", "10") //change refresh interval to a higher value
              .put("index.merge.async", true); //change index merge to async


Here is a sample code for creating ES client and using bulk load API for indexing.

 TransportClient client = new TransportClient( clientSettings.build() );
 List<TransportAddress> addresses = new LinkedList<TransportAddress>();
 //Add one or more ES address and port
 InetSocketTransportAddress address = new InetSocketTransportAddress("<ES_IP>)",Integer.parseInt("<ES_PORT>"));
 addresses.add(address);
 TransportAddress[] taddresses = addresses.toArray(new TransportAddress[addresses.size()]);
client.addTransportAddresses(taddresses);


// Create initial bulk request builder
BulkRequestBuilder bulkRequest = client.prepareBulk();
bulkRequest.setRefresh(false);
IndexRequestBuilder indexRequestBuilder = esLoader.getClient().prepareIndex("<ES_INDEX_NAME>", "regular");
//Build the JSON content using XContentBuilder
indexRequestBuilder.setSource(XContentBuilder);
bulkRequest.add(indexRequestBuilder);
BulkResponse bulkResponse = bulkRequest.execute().actionGet()

if (bulkResponse.hasFailures()) {
             log.info("Failed to send all requests in bulk " + bulkResponse.buildFailureMessage());
                return true;
            }
            else {
             log.info("Elasticsearch Index updated in {} ms.", bulkResponse.getTookInMillis());
}


Performance Tuning

1. Start with tuning the index refresh rate at the time of bulk indexing, While importing large amount of data it is recommended to disable refresh interval by setting to a value of -1, you can then refresh the index programmatically towards the end of the load.

You can define index refresh rate at global level by defining in config/elasticsearch.yml or at index level
a value of -1 will suppress it or you can set to any positive integer value based on your requirements of index refresh.
curl -XPUT localhost:9200/test/_settings -d '{
    "index" : {
        "refresh_interval" : "-1"
    } }'
2. You can decrease the bulk thread pool size,Thread pool size should be carefully tuned, under most circumstances defaults are good enough, but you can tune these based on your application requirements, for instance if you are expecting data to flow into the index all the time you can think of adding more thread pools for bulk index operation.
Always remember this rule of thumb, every thread eats up system resources, and try to match it with number of cores.

# Search pool
threadpool.search.type: fixed
threadpool.search.size: 3
threadpool.search.queue_size: 100

# Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 2
threadpool.bulk.queue_size: 300

# Index pool
threadpool.index.type: fixed
threadpool.index.size: 2
threadpool.index.queue_size: 100
3. if you want both - max perf on load and max perf on search - you should  use two indexes, one for the old generation and one for new generation, and connect them with an index alias. Distribute the indexes over the nodes so they form two separated groups, that is, so they use different machines (for example, by shard moving, shard allocation). Set replica level to 0 (no replicas) for the new gen index. Forward search only to those nodes with the old gen. After bulk is complete, add replica level to new gen, and switch from old to new with the help of index alias (or by just dropping the old gen). You may see a perf hit when replicas are building up but this is not much compared to bulk load.

4. One of the simplest and most effective strategy is to simply start with a no replica index. And once indexing is done, increase the number of replicas to the number you want to have. This will reduce the load when indexing.



44 comments:

  1. Good post to learn about options available for integration/ETL with Elasticsearch:

    Few more avaiable https://www.google.co.in/search?q=elasticsearch+etl

    ReplyDelete
  2. Wonderful article. I like your article

    ReplyDelete
  3. I would like to appreciate this article because it has a lot of info and giving more knowledge to all.

    ReplyDelete
  4. Thanks for sharing good article. Hoping more good post from you. Keep showing your potential.

    ReplyDelete
  5. Out writing resource have had an experience in doing things for a long time, and as such we know how to go about it to fulfill the needs of our customers. There is a challenge in handling things that have to deal with writing for many students, but not with us. We are waiting for you!

    ReplyDelete
  6. University life prepares numerous of college papers for students to make, and each of them can become a real challenge for young person to accept. Fortunately, a modern student has many essay writing assistants which can help with writing, and most of them are so easy to find online!

    ReplyDelete
  7. Chamber of Commerce Revenue Enhancement Strategy Back Dating Memberships best essay writing service bestessaywriting.com in cheap price Magento: Changing E Commerce Worldwide

    ReplyDelete
  8. Here we can read about the data import options with Elasticsearch.Thanks for sharing! I found here a lot of curious things;) And I need to go. My friend will visit me and help write my essay .

    ReplyDelete
  9. I am a student. This article helps me a lot and i have learned many useful things from this.

    ReplyDelete
  10. Advice for Marketers How to Balance Your Article, Blog and Forum Posting Top Research Paper Site Tips On Starting A Blog: An Article So Good Youll Leave Me A Tip

    ReplyDelete
  11. Your article has explore ES dataload for both of the categories. I liked it very much. To write such informative article you can take help from academic essay writing services

    ReplyDelete
  12. I’m impressed with your article on ES dataload options for ‘push’ vs ‘pull’ models. Here’s Resumes.Expert review that was very helpful to me!

    ReplyDelete
  13. It was very nice blog to learn about SAP BASIS. Thanks for sharing.SAP basis

    ReplyDelete
  14. Hi, your blog is very precious but Architecturally there are two approaches for dataload, at the outset, you will have to decide between "push" vs "pull" model based on your requirements and performance goals, in this article we will explore ES dataload options for both of these categories.
    Cheap Dissertation Writing Services

    ReplyDelete

  15. Hi, your diary is extremely precious however Architecturally there area unit 2 approaches for dataload, at the get-go, you'll need to decide between "push" vs "pull" model supported your needs and performance goals, during this article we'll explore Es dataload choices for each of those classes.
    free classified sites in pakistan

    ReplyDelete
  16. Have you been searching for ways to get level of popularity shortly? You merely require to Buy Facebook Followers to become renowned online. buy followers for facebook

    ReplyDelete
  17. Not more delaying to become famed currently. Buy Facebook Followers as a tactic to increase fame and acceptance online in a shorter duration. buy followers for facebook

    ReplyDelete
  18. This post offers some valuable ideas about ES dataload. It was an E-coomerce related blog and we can find some posts about E-commerce here.
    Admission essay writing service

    ReplyDelete
  19. We can see some well written ideas about ES dataload. It will be useful for those who need to know about e-commerce and all. Essay writing service reviews

    ReplyDelete
  20. تتعدد الشركات التي تقدم خدمات ىالتنظيف لاكن لا يمكن ان تكون كلها في نفس مستوي الجوده فان كنت من الباحثين عن جودة الشركه قبل اي شئ اخر فانصحة بزيارة احدي تلك الصفحات
    شركة تنظيف مساجد بالرياض
    شركة تنظيف خزانات بالخرج
    شركة تنظيف بالخرج
    والتي تقدم افضل خدمات التنظيف بالمنزل باعلي مستوي من الكفائه
    شركة تنظيف منازل بالطائف


    ReplyDelete
  21. Posting in your blog is really a matter of style. Here is how to write a good blog post and get the search engines promoting you for free.
    algebra connections chapter 1

    ReplyDelete
  22. I appreciate you people for taking your precious time to give us some insights on data import options with Elasticsearch. This is quite encouraging and helpful. I will keep on visiting your site for some more information. Keep it up!

    ReplyDelete
  23. E-COMMERCE site business is so great who have not so finance, can start at their home.
    As i am doing Send Flowers to Norway
    for your loved ones to make feel them happy.
    Send Flowers Worldwide

    ReplyDelete
  24. Your blog is very informative and great. Its very great read for me because your writing skills is so good and you will write this post in very good manner. Thanks!
    dissertation Writing Service

    ReplyDelete
  25. Thanks for your article! I have been looking for quite a long time and fortunately I read this article! I wish you would continue to have valuable articles like this or more to share with everyone

    ReplyDelete
  26. Summertime means that you and your family, including pets, will be spending more time outdoors. However, the new season can also bring along dangers for our pets, and the last thing you would want is to have an accident. gostream The death of one of your beloved pets along with a summertime pet memorial is not a good summer memory. Here are some tips to help your family stay safe this summer, especially your pets.

    ReplyDelete
  27. E commerce is playing a inevitable role in our daily life. Thanks to remind this article to remind those valuable contribution by them for the smooth living of our.

    ReplyDelete
  28. Today the society has changed a lot. Many technology has invented and thus changed the face of this society. And also the number of shops get reduced and form more E Commercial websites. Great evolution.

    ReplyDelete
  29. This area – some of the time two separate sections – experiences all that you have found amid the written work of the thesis. This area may require muddled measurable examination, or the making of charts and tables to show your information (contingent upon the teach of your work).DissertationPalace.co.uk

    ReplyDelete
  30. Wonderful, what a weblog it is! This website presents helpful information to us, keep it up. Send Gifts To pakistan

    ReplyDelete
  31. will be using the entire collected works of Shakespeare as our example data. In order to make the best use of Kibana you will likely want to apply a mapping to your new index. Let’s create the shakespeare index with the following mapping. Our data will have more fields than this, but these are the ones we want to explicitly map. Specifically we do not want to analyze speaker and play_name. You’ll see why later on.

    ReplyDelete
  32. I got very excited to see these trendy looks. I think all those who are looking of latest trends will really enjoy reading your post. Please provide more information and photos. I am eagerly waiting for your updated post to get it.
    dissertation Writing Service

    ReplyDelete
  33. Great article. i like this article very much. this all information will give you thank you for share this wonderful information.
    thesis writing help

    ReplyDelete
  34. This is quite encouraging and helpful. I will keep on visiting your site for some more information to work on essay master.

    ReplyDelete
  35. Grants are things which help you to pass your training unreservedly and can apply to greater colleges. DissertationPalace

    ReplyDelete
  36. Whiz Cube Company aim is to educate our visitors about Insurance and we are here for your right counseling different categories we have about Insurance and we are here to help and deliver you the exact Information about See Insurance as you need.

    ReplyDelete
  37. Astoundingly entrancing article. All things considered, when there is so significantly another I imagined that it was splendidly confiding in all the more unmistakable character boggling post from you. besides cheap essay writing, have broadened striking ground, I am to an astounding degree fulfilled. This site presents satisfying data to us, keep it up.

    ReplyDelete
  38. In the event that you don't need anything hopeless to transpire, at that point you should gain help with exposition as quickly as time permits. This is your exclusive shot in the event that you wish to manage a debilitating paper easily and comfort.Buy a Dissertation

    ReplyDelete
  39. Our journalists are especially accustomed in composing even the most compacted papers with unmatched phonetic utilize and stream of the substance. The best thing is your author will work your paper in which he/she has grabbed picking up top to bottom learning. Do my Essay UK

    ReplyDelete