Collecting Unique Tweets with JavaScript

October 9, 2024 · View on GitHub

human (Oct 8, 2024, 07:53 PM)

I wrote this:

tweets = Array.from(document.querySelectorAll('[data-testid=tweet'), el => {
    let datetime = el.querySelector('time').dateTime;
    let username = el.querySelector('[data-testid="User-Name"]').querySelector('a').href.split('/').slice(-1)[0];
    let tweet = el.querySelector('[data-testid="tweetText"]').innerText;
    let href = el.querySelector('time').closest('a').href;
});

Write me a block of JavaScript which runs this inside a setInterval every 500ms - it collects the tweets visible on the page into an array, but avoids collecting any with an href that has already been seen

The tweets should be in window.tweets so I can copy them out later

assistant (Oct 8, 2024, 07:53 PM)

Here's a JavaScript block that incorporates your code and runs it every 500ms, collecting unique tweets into window.tweets:

// Initialize window.tweets as an empty array if it doesn't exist
window.tweets = window.tweets || [];

// Set to store previously seen hrefs
const seenHrefs = new Set();

// Function to collect tweets
function collectTweets() {
  const newTweets = Array.from(document.querySelectorAll('[data-testid=tweet]'), el => {
    const datetime = el.querySelector('time').dateTime;
    const username = el.querySelector('[data-testid="User-Name"]').querySelector('a').href.split('/').slice(-1)[0];
    const tweet = el.querySelector('[data-testid="tweetText"]').innerText;
    const href = el.querySelector('time').closest('a').href;

    return { datetime, username, tweet, href };
  });

  // Filter out tweets with previously seen hrefs and add new ones to window.tweets
  newTweets.forEach(tweetObj => {
    if (!seenHrefs.has(tweetObj.href)) {
      seenHrefs.add(tweetObj.href);
      window.tweets.push(tweetObj);
    }
  });
}

// Run collectTweets every 500ms
setInterval(collectTweets, 500);

This script does the following:

Initializes window.tweets as an empty array if it doesn't already exist.
Creates a Set called seenHrefs to keep track of tweet URLs we've already processed.
Defines a collectTweets function that:
- Collects tweet data using your provided code.
- Filters out tweets with previously seen hrefs.
- Adds new tweets to window.tweets and their hrefs to seenHrefs.
Uses setInterval to run collectTweets every 500ms.

You can run this script in the browser's console on a Twitter page. The unique tweets will be collected in window.tweets, which you can access and copy out later as needed.

human (Oct 8, 2024, 08:00 PM)

a JavaScritp function which takes an HTML element, clones it, removes all class attributes from all elements within it and returns the new innerHTML

assistant (Oct 8, 2024, 08:00 PM)

Here's a JavaScript function that does what you've described:

function cloneAndStripClasses(element) {
  // Clone the element
  const clone = element.cloneNode(true);
  
  // Select all elements within the clone that have a class attribute
  const elementsWithClass = clone.querySelectorAll('[class]');
  
  // Remove the class attribute from each element
  elementsWithClass.forEach(el => el.removeAttribute('class'));
  
  // Return the innerHTML of the modified clone
  return clone.innerHTML;
}

This function does the following:

It takes an HTML element as an argument.
It creates a deep clone of the element using cloneNode(true).
It uses querySelectorAll('[class]') to select all elements within the clone that have a class attribute.
It iterates over these elements and removes the class attribute from each one using removeAttribute('class').
Finally, it returns the innerHTML of the modified clone.

You can use this function like this:

// Assuming you have an element with id 'myElement'
const myElement = document.getElementById('myElement');
const strippedHTML = cloneAndStripClasses(myElement);

console.log(strippedHTML);

This will give you the HTML content of myElement with all class attributes removed, without modifying the original element in the DOM.

human (Oct 8, 2024, 08:01 PM)

JavaScript one-liner to remove any span elements that only contain …

assistant (Oct 8, 2024, 08:01 PM)

Here's a JavaScript one-liner to remove any span elements that only contain "...":

document.querySelectorAll('span').forEach(span => span.textContent.trim() === '...' && span.remove());

This line does the following:

Selects all <span> elements in the document using querySelectorAll('span').
Iterates over each span using forEach().
Checks if the trimmed text content of the span is exactly "..." using span.textContent.trim() === '...'.
If the condition is true, removes the span from the DOM using span.remove().

You can run this in the browser's console to immediately remove such spans from the current page. If you want to apply this to a specific part of the document rather than the entire document, you can replace document with a reference to the parent element you want to target.

human (Oct 8, 2024, 08:06 PM)

use querySelector to find child elment with aria-label that contains "View analytics"

assistant (Oct 8, 2024, 08:06 PM)

Here's a JavaScript one-liner using querySelector to find a child element with an aria-label that contains "View analytics":

const element = parentElement.querySelector('[aria-label*="View analytics"]');

This line does the following:

Uses querySelector on the parentElement (you should replace this with the actual parent element you're searching within).
The attribute selector [aria-label*="View analytics"] looks for an element where the aria-label attribute contains the text "View analytics".
The *= in the selector means "contains" - it will match if "View analytics" appears anywhere in the aria-label value.

If you want to find all such elements instead of just the first one, you can use querySelectorAll:

const elements = parentElement.querySelectorAll('[aria-label*="View analytics"]');

Remember to replace parentElement with the actual element you're searching within. If you want to search the entire document, you can use document instead:

const element = document.querySelector('[aria-label*="View analytics"]');

This selector is case-sensitive. If you need it to be case-insensitive, you'd need to use a more complex approach, as CSS selectors don't have a built-in case-insensitive option.

human (Oct 8, 2024, 08:06 PM)

const impressions = parentElement.querySelector('[aria-label*="View post analytics"]');

Now use a regex to pull out the number from "11444 views. View post analytics" and convert that to integer

assistant (Oct 8, 2024, 08:06 PM)

Certainly. Here's a JavaScript one-liner that does what you're asking:

const impressionCount = parseInt(impressions.getAttribute('aria-label').match(/(\d+)/)[0], 10);

Let's break this down:

impressions.getAttribute('aria-label') gets the full text of the aria-label.
.match(/(\d+)/) uses a regular expression to find one or more digits in the string.
- \d+ matches one or more digits.
- The parentheses () create a capturing group.
[0] gets the first (and in this case, only) match from the regex.
parseInt(..., 10) converts the matched string of digits to an integer. The 10 specifies base 10.

If you want to make this more robust to handle cases where there might not be a number or the format might be different, you could use this slightly longer version:

const impressionCount = (impressions.getAttribute('aria-label').match(/(\d+)(?=\s+views)/) || [null])[0];
const viewCount = impressionCount ? parseInt(impressionCount, 10) : null;

This version:

Uses (?=\s+views) to ensure we're matching the number that comes right before "views".
Returns null if no match is found, instead of throwing an error.
Stores the result in viewCount, which will be either the integer number of views or null if no valid number was found.

This approach is more resilient to changes in the text format and will gracefully handle cases where the expected number isn't present.

human (Oct 8, 2024, 08:10 PM)

function collectTweets() {
  const newTweets = Array.from(document.querySelectorAll('[data-testid=tweet]'), el => {
    const datetime = el.querySelector('time').dateTime;
    const username = el.querySelector('[data-testid="User-Name"]').querySelector('a').href.split('/').slice(-1)[0];
    const tweet = el.querySelector('[data-testid="tweetText"]').innerText;
    const href = el.querySelector('time').closest('a').href;
    const likes = parseInt(el.querySelector('[data-testid="like"]').getAttribute('aria-label').match(/(\d+)/)[0], 10);
    const impressions = parseInt(el.querySelector('[aria-label*="View post analytics"]').getAttribute('aria-label').match(/(\d+)/)[0], 10);
    const retweets = parseInt(el.querySelector('[aria-label*="Repost"]').getAttribute('aria-label').match(/(\d+)/)[0], 10);
    return { datetime, username, tweet, href, likes, impressions, retweets };
  });

  // Filter out tweets with previously seen hrefs and add new ones to window.tweets
  newTweets.forEach(tweetObj => {
    if (!seenHrefs.has(tweetObj.href)) {
      seenHrefs.add(tweetObj.href);
      window.tweets.push(tweetObj);
    }
  });
}

I'm getting this:

TypeError: el.querySelector(...) is null

Rewrite this code so that it's robust against those errors, it should record the various counts as 0 if that happens

assistant (Oct 8, 2024, 08:10 PM)

Certainly. Here's a more robust version of your collectTweets function that handles potential null values and sets counts to 0 if the elements aren't found:

function collectTweets() {
  const newTweets = Array.from(document.querySelectorAll('[data-testid=tweet]'), el => {
    const safeQuerySelector = (selector) => el.querySelector(selector) || {};
    const safeGetAttribute = (element, attr) => element.getAttribute ? element.getAttribute(attr) : '';
    const safeExtractNumber = (str) => {
      const match = str.match(/(\d+)/);
      return match ? parseInt(match[0], 10) : 0;
    };

    const datetime = safeQuerySelector('time').dateTime || '';
    const username = (safeQuerySelector('[data-testid="User-Name"] a').href || '').split('/').slice(-1)[0] || '';
    const tweet = safeQuerySelector('[data-testid="tweetText"]').innerText || '';
    const href = safeQuerySelector('time').closest ? safeQuerySelector('time').closest('a').href : '';
    
    const likes = safeExtractNumber(safeGetAttribute(safeQuerySelector('[data-testid="like"]'), 'aria-label'));
    const impressions = safeExtractNumber(safeGetAttribute(safeQuerySelector('[aria-label*="View post analytics"]'), 'aria-label'));
    const retweets = safeExtractNumber(safeGetAttribute(safeQuerySelector('[aria-label*="Repost"]'), 'aria-label'));

    return { datetime, username, tweet, href, likes, impressions, retweets };
  });

  // Filter out tweets with previously seen hrefs and add new ones to window.tweets
  newTweets.forEach(tweetObj => {
    if (!seenHrefs.has(tweetObj.href)) {
      seenHrefs.add(tweetObj.href);
      window.tweets.push(tweetObj);
    }
  });
}

This version includes several improvements:

safeQuerySelector: A helper function that returns an empty object if querySelector returns null, preventing "cannot read property of null" errors.
safeGetAttribute: A helper function that safely gets an attribute, returning an empty string if the element or the getAttribute method doesn't exist.
safeExtractNumber: A helper function that safely extracts a number from a string, returning 0 if no number is found.
Each property is now extracted using these safe methods, with fallbacks to empty strings or 0 as appropriate.
The closest method is checked for existence before being used.

This version of the function should be much more resilient to missing elements or attributes. If any part of the tweet's structure is not found, it will default to empty values or 0 for numeric fields, rather than throwing an error.

Remember to ensure that seenHrefs and window.tweets are properly initialized before running this function:

const seenHrefs = new Set();
window.tweets = window.tweets || [];

This approach allows the function to collect as much data as possible from each tweet, even if some parts of the tweet's structure are missing or have changed.

human (Oct 8, 2024, 08:15 PM)

// Ditch any <span>…</span> elements
document
  .querySelectorAll("span")
  .forEach((span) => span.textContent.trim() === "…" && span.remove());

// Initialize window.tweets as an empty array if it doesn't exist
window.tweets = window.tweets || [];

// Set to store previously seen hrefs
let seenHrefs = new Set();

// Function to collect tweets
function collectTweets() {
  const newTweets = Array.from(
    document.querySelectorAll("[data-testid=tweet]"),
    (el) => {
      const safeQuerySelector = (selector) => el.querySelector(selector) || {};
      const safeGetAttribute = (element, attr) =>
        element.getAttribute ? element.getAttribute(attr) : "";
      const safeExtractNumber = (str) => {
        const match = str.match(/(\d+)/);
        return match ? parseInt(match[0], 10) : 0;
      };

      const datetime = safeQuerySelector("time").dateTime || "";
      const username =
        (safeQuerySelector('[data-testid="User-Name"] a').href || "")
          .split("/")
          .slice(-1)[0] || "";
      const tweet =
        safeQuerySelector('[data-testid="tweetText"]').innerText || "";
      const href = safeQuerySelector("time").closest
        ? safeQuerySelector("time").closest("a").href
        : "";

      const likes = safeExtractNumber(
        safeGetAttribute(
          safeQuerySelector('[data-testid="like"]'),
          "aria-label",
        ),
      );
      const impressions = safeExtractNumber(
        safeGetAttribute(
          safeQuerySelector('[aria-label*="View post analytics"]'),
          "aria-label",
        ),
      );
      const retweets = safeExtractNumber(
        safeGetAttribute(
          safeQuerySelector('[aria-label*="Repost"]'),
          "aria-label",
        ),
      );
      return { datetime, username, tweet, href, likes, impressions, retweets };
    },
  );

  // Filter out tweets with previously seen hrefs and add new ones to window.tweets
  newTweets.forEach((tweetObj) => {
    if (!seenHrefs.has(tweetObj.href)) {
      seenHrefs.add(tweetObj.href);
      window.tweets.push(tweetObj);
    }
  });
}

// Run collectTweets every 500ms
setInterval(collectTweets, 500);

// Run copy(window.tweets) later to copy collected tweets to the clipboard

Instead of all that safeX stuff, roll back to my earlier code - write a extractNumber(select) function which attempts to pull the integers out of the aria label of the thing matched by the provided selector and use that for the likes, impressions and retweets

assistant (Oct 8, 2024, 08:15 PM)

Certainly! Here's a revised version of your code that uses an extractNumber function to handle the likes, impressions, and retweets:

// Ditch any <span>…</span> elements
document
  .querySelectorAll("span")
  .forEach((span) => span.textContent.trim() === "…" && span.remove());

// Initialize window.tweets as an empty array if it doesn't exist
window.tweets = window.tweets || [];

// Set to store previously seen hrefs
let seenHrefs = new Set();

// Function to extract number from aria-label
function extractNumber(el, selector) {
  const element = el.querySelector(selector);
  if (element && element.getAttribute) {
    const match = element.getAttribute('aria-label').match(/(\d+)/);
    return match ? parseInt(match[0], 10) : 0;
  }
  return 0;
}

// Function to collect tweets
function collectTweets() {
  const newTweets = Array.from(
    document.querySelectorAll("[data-testid=tweet]"),
    (el) => {
      const datetime = el.querySelector("time")?.dateTime || "";
      const username = el.querySelector('[data-testid="User-Name"] a')?.href.split("/").slice(-1)[0] || "";
      const tweet = el.querySelector('[data-testid="tweetText"]')?.innerText || "";
      const href = el.querySelector("time")?.closest("a")?.href || "";

      const likes = extractNumber(el, '[data-testid="like"]');
      const impressions = extractNumber(el, '[aria-label*="View post analytics"]');
      const retweets = extractNumber(el, '[aria-label*="Repost"]');

      return { datetime, username, tweet, href, likes, impressions, retweets };
    }
  );

  // Filter out tweets with previously seen hrefs and add new ones to window.tweets
  newTweets.forEach((tweetObj) => {
    if (!seenHrefs.has(tweetObj.href)) {
      seenHrefs.add(tweetObj.href);
      window.tweets.push(tweetObj);
    }
  });
}

// Run collectTweets every 500ms
setInterval(collectTweets, 500);

// Run copy(window.tweets) later to copy collected tweets to the clipboard

This version makes the following changes:

It introduces an extractNumber function that takes the parent element and a selector. This function tries to find the element, get its aria-label attribute, extract the number from it, and return the parsed integer or 0 if anything fails.
The main collectTweets function now uses optional chaining (?.) for accessing properties that might not exist, providing fallback empty strings where appropriate.
It uses the new extractNumber function for likes, impressions, and retweets, which simplifies the code and makes it more robust against missing elements or attributes.
The code still retains the ability to handle cases where elements or attributes are missing, defaulting to empty strings for text fields and 0 for numeric fields.

This approach should give you a good balance between simplicity and robustness. It will attempt to extract the data you need, and if any part fails (due to missing elements or changed structure), it will default to safe values without throwing errors.