This endpoint has been updated to include Post edit metadata. Learn more about these metadata on the “Edit Posts” fundamentals page

Decahose stream

Enterprise

This is an enterprise API available within our managed access levels only. To use this API, you must first set up an account with our enterprise sales team. Learn more

The Decahose delivers a 10% random sample of the realtime X Firehose through a streaming connection. This is accomplished via a realtime sampling algorithm which randomly selects the data, while still allowing for the expected low-latency delivery of data as it is sent through the firehose by X.

Below are some of the features available with Decahose:

  • Expanded and enhanced URLs: - fully unwinds shortened URLs and provides additional metadata (page title and description)
  • Stream partitioning - 2 partitions, each containing 50% of volume of the Decahose stream
  • Enhanced reliability - geographic diversity of backend systems

Note: This data is delivered in bulk, and does not support additional filtering (e.g. for keywords).

ENTERPRISE

Streaming likes

This is an enterprise API available within our managed access levels only. To use this API, you must first set up an account with our enterprise sales team. Learn more

Likes enable insight into who likes Posts and delivers accurate counts of likes. Gnip’s Firehose and Decahose can deliver public likes related to the Posts delivered via Gnip. This yields real-time public engagement and audience metrics associated with a Post.  

Getting started with Likes

As you prepare to consume likes data, you should know that:

  • Likes are delivered via an independent, separate stream
  • Likes are historically referred to as “Favorites”. The enriched native format payload maintains this nomenclature
  • Streams include only public likes
    • Public means that the liking user, Post creator and Post are all public on the platform
  • Likes are very similar to Retweets and represent a public signal of engagement
  • Payload elements include:
    • Original Post object
    • Actor object that created the original Post
    • Actor object that performed the like action
  • Only original content can be liked
    • Retweets cannot be liked. A like of a Retweet is applied to the original Post
    • Quoted Tweets can be liked
  • Like activities include applicable Gnip Enrichments (where purchased/applied)
  • Supported Products / Features
    • Likes streams support Backfill (where purchased/applied)
    • There is no Replay support for likes streams
    • There is no Search or Historical support for likes
    • There are no immediate plans to add likes support to PowerTrack

Decahose

Native enriched format payload

{
   "id":"43560406e0ad9f68374445f5f30c33fc",
   "created_at":"Thu Dec 01 22:27:39 +0000 2016",
   "timestamp_ms":1480631259353,
   "favorited_status":{
      "created_at":"Thu Dec 01 22:27:16 +0000 2016",
      "id":804451830033948672,
      "id_str":"804451830033948672",
      "text":"@kafammheppduman",
      "source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e",
      "truncated":false,
      "in_reply_to_status_id":803694205163814912,
      "in_reply_to_status_id_str":"803694205163814912",
      "in_reply_to_user_id":2855759795,
      "in_reply_to_user_id_str":"2855759795",
      "in_reply_to_screen_name":"kafammheppduman",
      "user":{
         "id":2855759795,
         "id_str":"2855759795",
         "name":"delirdim kanka",
         "screen_name":"kafammheppduman",
         "location":"sanane",
         "url":"http:\/\/instagram.com\/kafammheppduman",
         "description":"Manit @GalatasaraySk \ud83d\udc9e",
         "translator_type":"none",
         "protected":false,
         "verified":false,
         "followers_count":3702,
         "friends_count":607,
         "listed_count":1,
         "favourites_count":113338,
         "statuses_count":389,
         "created_at":"Sat Nov 01 22:38:25 +0000 2014",
         "utc_offset":null,
         "time_zone":null,
         "geo_enabled":true,
         "lang":"tr",
         "contributors_enabled":false,
         "is_translator":false,
         "profile_background_color":"C0DEED",
         "profile_background_image_url":"",
         "profile_background_image_url_https":"",
         "profile_background_tile":false,
         "profile_link_color":"1DA1F2",
         "profile_sidebar_border_color":"C0DEED",
         "profile_sidebar_fill_color":"DDEEF6",
         "profile_text_color":"333333",
         "profile_use_background_image":true,
       "Profile_image_url": "http:\/\/pbs.twimg.com\/profile_images\/804421763945861121\/v3bp9pnq_normal.jpg",
         "Profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/804421763945861121\/v3bp9pnq_normal.jpg",
         "profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/2855759795\/1480630085",
         "default_profile":true,
         "default_profile_image":false,
         "following":null,
         "follow_request_sent":null,
         "notifications":null
      },
      "geo":null,
      "coordinates":null,
      "place":null,
      "contributors":null,
      "is_quote_status":false,
      "retweet_count":0,
      "favorite_count":0,
      "entities":{
         "hashtags":[],
         "urls":[],
         "user_mentions":[
            {
               "screen_name":"kafammheppduman",
               "name":"delirdim kanka",
               "id":2855759795,
               "id_str":"2855759795",
               "indices":[
                  0,
                  16
               ]
            }
         ],
         "symbols":[]
      },
      "favorited":false,
      "retweeted":false,
      "filter_level":"low",
      "lang":"und"
   },
   "user":{
      "id":774146932365070336,
      "id_str":"774146932365070336",
      "name":"Uyuyan Adam",
      "screen_name":"saykoMenn",
      "location":"Tarsus, T\u00fcrkiye",
      "url":"http:\/\/connected2.me\/pmc1i",
      "description":null,
      "translator_type":"none",
      "protected":false,
      "verified":false,
      "followers_count":414,
      "friends_count":393,
      "listed_count":0,
      "favourites_count":9868,
      "statuses_count":370,
      "created_at":"Fri Sep 09 07:26:26 +0000 2016",
      "utc_offset":null,
      "time_zone":null,
      "geo_enabled":false,
      "lang":"tr",
      "contributors_enabled":false,
      "is_translator":false,
      "profile_background_color":"F5F8FA",
      "profile_background_image_url":"",
      "profile_background_image_url_https":"",
      "profile_background_tile":false,
      "profile_link_color":"1DA1F2",
      "profile_sidebar_border_color":"C0DEED",
      "profile_sidebar_fill_color":"DDEEF6",
      "profile_text_color":"333333",
      "profile_use_background_image":true,
      "Profile_image_url": "http:\/\/pbs.twimg.com\/profile_images\/802992813424201728\/VMzcTL3x_normal.jpg",
      "Profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/802992813424201728\/VMzcTL3x_normal.jpg",
      "profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/774146932365070336\/1480283382",
      "default_profile":true,
      "default_profile_image":false,
      "following":null,
      "follow_request_sent":null,
      "notifications":null
   }
}

Like Delete / “Unlike” payload

{
   "delete":{
      "favorite":{
         "tweet_id":696615514970279937,
         "tweet_id_str":"696615514970279937",
         "user_id":2510287578,
         "user_id_str":"2510287578"
      },
      "timestamp_ms":"1480437031205"
   }
}

Guides

Recovery and Redundency

Introduction 

With streaming high volumes of realtime Posts comes a set of best practices that promote both data reliability and data full-fidelity. When consuming realtime data, maximizing your connection time is a fundamental goal. When disconnects occur, it is important to automatically detect that and reconnect. After reconnecting it’s important to assess if there are any periods to backfill data for. The component that manages these details and consumes realtime Posts is only one part of a system with network, datastore, server, and storage concerns. Given the complexity of these systems, another best practice is to have different streaming environments, with at least separate streams for development/testing and production.

Decahose comes with a set of features that help with these efforts.

  1. To support multiple environments, we can deploy Additional Streams for your account. These streams are independent of each other and have a different stream_label  to help differentiate them.
  2. To help support maintaining a connection, each Decahose stream supports Redundant Connections. The most common architecture is for a stream to have two connections, and on the client-side there are two independent consumers –ideally on different networks. With this design, there can be redundancy across the client-side networks, servers, and datastore pathways. Note that a full-copy of the data is served on each connection and the client-side must be tolerant of and manage duplicate data.
  3. A ‘heartbeat’ will be provided every 10 seconds; however, with the Decahose stream, the volume of data is high enough that even a small duration (e.g., a few seconds) of no Posts can indicate a connection issue. Therfore, both a ‘data silence’ and lack of a heartbeat can be used to detect a disconnect.

Since disconnects will happen, the Decahose stream has a dedicated Recovery and a Backfill feature to help recover data that was missed due to disconnections and other operational issues.

Additional Streams

Having additional Decahose streams is another way to help build reliability into your solution. Any additional streams are completely independent, having their unique endpoint. Each stream is assigned its own stream_label, and this label, along with your account name, are part of that stream’s URL. See the example below:

https://gnip-stream.twitter.com/stream/sample10/accounts/:account\_name/publishers/twitter/:stream\_label.json

The most common convention is to have a realtime stream dedicated for you production system, and an additional stream available for development and testing. Having a test/development stream enables Decahose customers to have a stream to test client consumer updates. While any (unique) label can be assigned to a stream, one convention is to use ‘prod’ for production stream, and ‘dev’ or ‘sandbox’ for an additional development stream.

The number of streams, and their unique labels, is configurable by your account representative.

Redundant Connections

A redundant connection simply allows you to establish more than one simultaneous connection to the data stream. This provides redundancy by allowing you to connect to the same stream with two separate consumers, receiving the same data through both connections. Thus, your app has a hot failover for various situations, e.g. where one stream is disconnected or where your app’s primary server fails.

The number of connections allowed for any given stream is configurable by your account representative. To use a redundant stream, simply connect to the same URL used for your primary connection. The data for your stream will be sent through both connections, with both stream connections represented on the stream dashboard.

Note that for billing purposes, we deduplicate the activity counts you receive through multiple connections such that you are only billed for each unique activity once. Given the Decahose has two partitions, here’s an example of how the connection count works below:

Connect to decahose partition=1 Connect to decahose partition=1 Connect to decahose partition=2

The above situation yields a total of three connections – two connections to partition=1 and one connection to partition=2. Normally, you would want the same number of connections to each partition, so this example highlights a situation where the redundant connection to partition=2 has dropped and you want to further invstigate.

Recovery

Overview 

Recovery is a data recovery tool (not to be used for primary data collection) that provides streaming access to a rolling 5-day window of recent X historical data. It should be utilized to recover data in scenarios where your consuming application misses data in the realtime stream, whether due to disconnecting for a short period, or for any other scenario where you fail to ingest realtime data for a period of time.

Using Recovery 

With the Recovery stream, your app can make requests to it that operate in the same manner as requests to the realtime streams. However, your app must specify parameters in the URL that indicate the time window you are requesting. In other words, a Recovery request asks the API for “Posts from time A to time B.” These Posts are then delivered through your streaming connection in a manner that mimics the realtime stream, but at a slightly slower-than-realtime rate. See below for example:

https://stream-data-api.x.com/stream/powertrack/accounts/someAccountName/publishers/twitter/powertrack.json?startTime=2023-07-05T17:09:12.070Z

Posts are delivered beginning with the first (oldest) minute of the specified time period, continuing chronologically until the final minute is delivered. At that point, aRecovery Request Completed message is sent through the connection, and the connection is then closed by the server. If your request begins at a time of day where little or no matching results occurred, there will likely be some period of time before the first results are delivered – data will be delivered when Recovery encounters matches in the portion of the archive being processed at that time. When no results are available to deliver, the stream will continue sending carriage-return, or “heartbeats”, through the connection to prevent you from timing out.

Recovery is intended as a tool for easily recovering data missed due to short disconnects, not for very long time periods like an entire day. If the need to recover data for long periods arises, we recommend breaking longer requests into shorter time windows (e.g. two hours) to reduce the possibility of being disconnected mid-request due to internet volatility or other reasons, and to provide more visibility into the progress of long requests.

Data Availability

You can use the Recovery feature to recover missed data within the last 24 hours if you are unable to reconnect with the 5 minute backfill window.

The streaming recovery feature allows you to have an extended backfill window of 24 hours. Recovery enables you to ‘recover’ the time period of missed data. A recovery stream is started when you make a connection request using ‘start_time’ and ‘end_time’ request parameters. Once connected, Recovery will re-stream the time period indicated, then disconnect.  

You will be able to make 2 concurrent requests to recovery at the same time, i.e. “two recovery jobs”. Recovery works technically in the same way as backfill, except a start and end time is defined. A recovery period is for a single time range.

Backfill

To request backfill, you need to add a backfillMinutes=N parameter to your connection request, where N is the number of minutes (1-5, whole numbers only) to backfill when the connection is made. For example, if you disconnect for 90 seconds, you should add backfillMinutes=2 to your connection request. Since this request will provide backfill for 2 minutes, including for the 30-second period before you disconnected, your consumer app must be tolerant of duplicate data.

An example Decahose connection request URL, requesting a 5 minute backfill to partition 1, looks like:

https://gnip-stream.twitter.com/stream/sample10/accounts/:account\_name/publishers/twitter/:stream\_label.json?partition=1&backfillMinutes=5

NOTES:

  • You do have the option to always use ‘backfillMinutes=5’ when you connect, then handling any duplicate data that is provided.

  • If you are disconnected for more than five minutes, you can recover data using Recovery.

Recovering from Disconnect

Restarting and recovering from a disconnect involves several steps:

  • Determining length of disconnect time period.
    • 5 minutes or less?
      • If you have Backfill enabled for stream, prepare connection request with appropriate ‘backfillMinutes’ parameter.
    • More than 5 minutes?
      • If you have a Recovery stream, make a Recovery request for the disconnected time period (ideally with your current realtime rule set, using the Rules API if necessary).
  • Request a new connection.

When you experience disconnects or downtime, here are strategies to mitigate and recover in this scenario:

  1. Implement backfill Backfill lets you reconnect from a point prior to disconnecting from a stream connection, and covers disconnects of up to 5 minutes. It is implemented by including a parameter in the connection request.

  2. Consume a redundant stream from another location If the redundant stream can be streamed into the same live environment, deduplicating data, you will eliminate the need for recovery unless BOTH the normal stream and redundant stream experience simultaneous downtime or disconnects. If the redundant stream cannot be streamed live into the production environment, it can be written into a separate “emergency” data store. Then, in the event of disconnects or downtime on the primary stream connection, your system will have data on hand to fill in your primary database for the period of time where data is missing

  3. Implement Recovery Where disconnects or downtime affect both the primary stream and redundant stream, use the Decahose Recovery to recover any data missed. The API provides a rolling window covering 5 days of the archive, and would be best utilized by requesting no more than an hour of that window at a time, and streaming it in. This is done in parallel to the realtime stream. Note that we do not have solutions for recovering Decahose data from beyond the 5 day window provided by Recovery, so it is important for you to utilize a redundant stream to ensure you have a complete copy of data on your side in the case of significant downtime on your side.

When you detect abnormal stored data volumes-  Potential ways you can detect missing data where no disconnects or downtime occurred…

  1. Count inbound Posts Your system should count the raw number of Posts you receive at the very beginning of your ingestion app, and then provide a way to compare those numbers to the number of Posts that reaches your final data store. Any differences can be monitored, and alert your team to issues causing data to be dropped after it is received.

  2. Analyze for abnormal stored volumes You may also want to analyze the volumes of stored data in your final database to look for abnormal drops. This can indicate issues as well, although there will likely be circumstances in which drops in volume are normal (e.g. if the X platform is unavailable and people are not able to create Posts for some period of time).

API Reference

Decahose stream

Jump to on this page

Methods

Authentication

GET /{stream-type}/:stream

Replay API

Methods

MethodDescription
GET /{stream-type}/:streamConnect to the data stream

Authentication

All requests to the Volume Stream APIs must use HTTP Basic Authentication, constructed from a valid email address and password combination used to log into your account at console.gnip.com. Credentials must be passed as the Authorization header for each request. So confirm that your client is adding the “Authentication: Basic” HTTP header (with encoded credentials over HTTPS) to all API requests.

GET {stream-type}:stream

Establishes a persistent connection to the Firehose stream, through which the realtime data will be delivered.

Note: Please see this article for additional details on consuming streaming data after the connection is established.

Request Specifications

Request MethodHTTP GET
Connection TypeKeep-Alive

This should be specified in the header of the request.
URLFound on the stream’s API Help page of your dashboard, using the following structure:

Decahose:

https://gnip-stream.twitter.com/stream/sample10/accounts/:account_name/publishers/twitter/:stream_label.json?partition=1
Partition (required)partition=\{#} - Partitioning is now required in order to consume the full stream. You will need to connect to the stream with the partition parameter specified. Below is the number of partitions per stream:

* Decahose: 2 partitions
CompressionGzip. To connect to the stream using Gzip compression, simply send an Accept-Encoding header in the connection request. The header should look like the following:

Accept-Encoding: gzip
Character EncodingUTF-8
Response FormatJSON. The header of your request should specify JSON format for the response.
Rate Limit10 requests per 60 seconds.
Backfill ParameterIf you have purchased a stream with Backfill enabled, you’ll need to add the “backfillMinutes” parameter into GET request to enable it.
Read TimeoutSet a read timeout on your client, and ensure that it is set to a value beyond 30 seconds.
Support for Tweet editsAll Tweet objects will include Tweet edit metadata describing the Tweet’s edit history. See the “Edit Tweets” fundamentals page for more details.

Responses

The following responses may be returned by the API for these requests. Most error codes are returned with a string with additional details in the body. For non-200 responses, clients should attempt to reconnect.

StatusTextDescription
200SuccessThe connection was successfully opened, and new activities will be sent through as they arrive.
401UnauthorizedHTTP authentication failed due to invalid credentials. Log in to console.gnip.com with your credentials to ensure you are using them correctly with your request.
406Not AcceptableGenerally, this occurs where your client fails to properly include the headers to accept gzip encoding from the stream, but can occur in other circumstances as well.

Will contain a JSON message similar to “This connection requires compression. To enable compression, send an ‘Accept-Encoding: gzip’ header in your request and be ready to uncompress the stream as it is read on the client end.”
429Rate LimitedYour app has exceeded the limit on connection requests.
503Service UnavailableTwitter server issue. Reconnect using an exponential backoff pattern. If no notice about this issue has been posted on the Twitter API Status Page, contact support or emergency if unable to connect after 10 minutes.

Example curl Request

The following example request is accomplished using cURL on the command line. However, note that these requests may also be sent with the programming language of your choice:

curl --compressed -v -uexample@customer.com "https://gnip-stream.twitter.com/stream/firehose/accounts/:account\_name/publishers/twitter/:stream\_label.json?partition={#}"

Replay API

The Replay API is an important complement to realtime Volume streams. Replay is a data recovery tool that provides streaming access to a rolling window of recent Twitter historical data.