Collecting Unique Tweets with JavaScript
October 9, 2024 · View on GitHub
human (Oct 8, 2024, 07:53 PM)
I wrote this:
tweets = Array.from(document.querySelectorAll('[data-testid=tweet'), el => {
let datetime = el.querySelector('time').dateTime;
let username = el.querySelector('[data-testid="User-Name"]').querySelector('a').href.split('/').slice(-1)[0];
let tweet = el.querySelector('[data-testid="tweetText"]').innerText;
let href = el.querySelector('time').closest('a').href;
});
Write me a block of JavaScript which runs this inside a setInterval every 500ms - it collects the tweets visible on the page into an array, but avoids collecting any with an href that has already been seen
The tweets should be in window.tweets so I can copy them out later
assistant (Oct 8, 2024, 07:53 PM)
Here's a JavaScript block that incorporates your code and runs it every 500ms, collecting unique tweets into window.tweets:
// Initialize window.tweets as an empty array if it doesn't exist
window.tweets = window.tweets || [];
// Set to store previously seen hrefs
const seenHrefs = new Set();
// Function to collect tweets
function collectTweets() {
const newTweets = Array.from(document.querySelectorAll('[data-testid=tweet]'), el => {
const datetime = el.querySelector('time').dateTime;
const username = el.querySelector('[data-testid="User-Name"]').querySelector('a').href.split('/').slice(-1)[0];
const tweet = el.querySelector('[data-testid="tweetText"]').innerText;
const href = el.querySelector('time').closest('a').href;
return { datetime, username, tweet, href };
});
// Filter out tweets with previously seen hrefs and add new ones to window.tweets
newTweets.forEach(tweetObj => {
if (!seenHrefs.has(tweetObj.href)) {
seenHrefs.add(tweetObj.href);
window.tweets.push(tweetObj);
}
});
}
// Run collectTweets every 500ms
setInterval(collectTweets, 500);
This script does the following:
- Initializes
window.tweetsas an empty array if it doesn't already exist. - Creates a
SetcalledseenHrefsto keep track of tweet URLs we've already processed. - Defines a
collectTweetsfunction that:- Collects tweet data using your provided code.
- Filters out tweets with previously seen hrefs.
- Adds new tweets to
window.tweetsand their hrefs toseenHrefs.
- Uses
setIntervalto runcollectTweetsevery 500ms.
You can run this script in the browser's console on a Twitter page. The unique tweets will be collected in window.tweets, which you can access and copy out later as needed.
human (Oct 8, 2024, 08:00 PM)
a JavaScritp function which takes an HTML element, clones it, removes all class attributes from all elements within it and returns the new innerHTML
assistant (Oct 8, 2024, 08:00 PM)
Here's a JavaScript function that does what you've described:
function cloneAndStripClasses(element) {
// Clone the element
const clone = element.cloneNode(true);
// Select all elements within the clone that have a class attribute
const elementsWithClass = clone.querySelectorAll('[class]');
// Remove the class attribute from each element
elementsWithClass.forEach(el => el.removeAttribute('class'));
// Return the innerHTML of the modified clone
return clone.innerHTML;
}
This function does the following:
- It takes an HTML element as an argument.
- It creates a deep clone of the element using
cloneNode(true). - It uses
querySelectorAll('[class]')to select all elements within the clone that have aclassattribute. - It iterates over these elements and removes the
classattribute from each one usingremoveAttribute('class'). - Finally, it returns the
innerHTMLof the modified clone.
You can use this function like this:
// Assuming you have an element with id 'myElement'
const myElement = document.getElementById('myElement');
const strippedHTML = cloneAndStripClasses(myElement);
console.log(strippedHTML);
This will give you the HTML content of myElement with all class attributes removed, without modifying the original element in the DOM.
human (Oct 8, 2024, 08:01 PM)
JavaScript one-liner to remove any span elements that only contain …
assistant (Oct 8, 2024, 08:01 PM)
Here's a JavaScript one-liner to remove any span elements that only contain "...":
document.querySelectorAll('span').forEach(span => span.textContent.trim() === '...' && span.remove());
This line does the following:
- Selects all
<span>elements in the document usingquerySelectorAll('span'). - Iterates over each span using
forEach(). - Checks if the trimmed text content of the span is exactly "..." using
span.textContent.trim() === '...'. - If the condition is true, removes the span from the DOM using
span.remove().
You can run this in the browser's console to immediately remove such spans from the current page. If you want to apply this to a specific part of the document rather than the entire document, you can replace document with a reference to the parent element you want to target.
human (Oct 8, 2024, 08:06 PM)
use querySelector to find child elment with aria-label that contains "View analytics"
assistant (Oct 8, 2024, 08:06 PM)
Here's a JavaScript one-liner using querySelector to find a child element with an aria-label that contains "View analytics":
const element = parentElement.querySelector('[aria-label*="View analytics"]');
This line does the following:
- Uses
querySelectoron theparentElement(you should replace this with the actual parent element you're searching within). - The attribute selector
[aria-label*="View analytics"]looks for an element where thearia-labelattribute contains the text "View analytics". - The
*=in the selector means "contains" - it will match if "View analytics" appears anywhere in thearia-labelvalue.
If you want to find all such elements instead of just the first one, you can use querySelectorAll:
const elements = parentElement.querySelectorAll('[aria-label*="View analytics"]');
Remember to replace parentElement with the actual element you're searching within. If you want to search the entire document, you can use document instead:
const element = document.querySelector('[aria-label*="View analytics"]');
This selector is case-sensitive. If you need it to be case-insensitive, you'd need to use a more complex approach, as CSS selectors don't have a built-in case-insensitive option.
human (Oct 8, 2024, 08:06 PM)
const impressions = parentElement.querySelector('[aria-label*="View post analytics"]');
Now use a regex to pull out the number from "11444 views. View post analytics" and convert that to integer
assistant (Oct 8, 2024, 08:06 PM)
Certainly. Here's a JavaScript one-liner that does what you're asking:
const impressionCount = parseInt(impressions.getAttribute('aria-label').match(/(\d+)/)[0], 10);
Let's break this down:
impressions.getAttribute('aria-label')gets the full text of the aria-label..match(/(\d+)/)uses a regular expression to find one or more digits in the string.\d+matches one or more digits.- The parentheses
()create a capturing group.
[0]gets the first (and in this case, only) match from the regex.parseInt(..., 10)converts the matched string of digits to an integer. The10specifies base 10.
If you want to make this more robust to handle cases where there might not be a number or the format might be different, you could use this slightly longer version:
const impressionCount = (impressions.getAttribute('aria-label').match(/(\d+)(?=\s+views)/) || [null])[0];
const viewCount = impressionCount ? parseInt(impressionCount, 10) : null;
This version:
- Uses
(?=\s+views)to ensure we're matching the number that comes right before "views". - Returns
nullif no match is found, instead of throwing an error. - Stores the result in
viewCount, which will be either the integer number of views ornullif no valid number was found.
This approach is more resilient to changes in the text format and will gracefully handle cases where the expected number isn't present.
human (Oct 8, 2024, 08:10 PM)
function collectTweets() {
const newTweets = Array.from(document.querySelectorAll('[data-testid=tweet]'), el => {
const datetime = el.querySelector('time').dateTime;
const username = el.querySelector('[data-testid="User-Name"]').querySelector('a').href.split('/').slice(-1)[0];
const tweet = el.querySelector('[data-testid="tweetText"]').innerText;
const href = el.querySelector('time').closest('a').href;
const likes = parseInt(el.querySelector('[data-testid="like"]').getAttribute('aria-label').match(/(\d+)/)[0], 10);
const impressions = parseInt(el.querySelector('[aria-label*="View post analytics"]').getAttribute('aria-label').match(/(\d+)/)[0], 10);
const retweets = parseInt(el.querySelector('[aria-label*="Repost"]').getAttribute('aria-label').match(/(\d+)/)[0], 10);
return { datetime, username, tweet, href, likes, impressions, retweets };
});
// Filter out tweets with previously seen hrefs and add new ones to window.tweets
newTweets.forEach(tweetObj => {
if (!seenHrefs.has(tweetObj.href)) {
seenHrefs.add(tweetObj.href);
window.tweets.push(tweetObj);
}
});
}
I'm getting this:
TypeError: el.querySelector(...) is null
Rewrite this code so that it's robust against those errors, it should record the various counts as 0 if that happens
assistant (Oct 8, 2024, 08:10 PM)
Certainly. Here's a more robust version of your collectTweets function that handles potential null values and sets counts to 0 if the elements aren't found:
function collectTweets() {
const newTweets = Array.from(document.querySelectorAll('[data-testid=tweet]'), el => {
const safeQuerySelector = (selector) => el.querySelector(selector) || {};
const safeGetAttribute = (element, attr) => element.getAttribute ? element.getAttribute(attr) : '';
const safeExtractNumber = (str) => {
const match = str.match(/(\d+)/);
return match ? parseInt(match[0], 10) : 0;
};
const datetime = safeQuerySelector('time').dateTime || '';
const username = (safeQuerySelector('[data-testid="User-Name"] a').href || '').split('/').slice(-1)[0] || '';
const tweet = safeQuerySelector('[data-testid="tweetText"]').innerText || '';
const href = safeQuerySelector('time').closest ? safeQuerySelector('time').closest('a').href : '';
const likes = safeExtractNumber(safeGetAttribute(safeQuerySelector('[data-testid="like"]'), 'aria-label'));
const impressions = safeExtractNumber(safeGetAttribute(safeQuerySelector('[aria-label*="View post analytics"]'), 'aria-label'));
const retweets = safeExtractNumber(safeGetAttribute(safeQuerySelector('[aria-label*="Repost"]'), 'aria-label'));
return { datetime, username, tweet, href, likes, impressions, retweets };
});
// Filter out tweets with previously seen hrefs and add new ones to window.tweets
newTweets.forEach(tweetObj => {
if (!seenHrefs.has(tweetObj.href)) {
seenHrefs.add(tweetObj.href);
window.tweets.push(tweetObj);
}
});
}
This version includes several improvements:
-
safeQuerySelector: A helper function that returns an empty object ifquerySelectorreturns null, preventing "cannot read property of null" errors. -
safeGetAttribute: A helper function that safely gets an attribute, returning an empty string if the element or thegetAttributemethod doesn't exist. -
safeExtractNumber: A helper function that safely extracts a number from a string, returning 0 if no number is found. -
Each property is now extracted using these safe methods, with fallbacks to empty strings or 0 as appropriate.
-
The
closestmethod is checked for existence before being used.
This version of the function should be much more resilient to missing elements or attributes. If any part of the tweet's structure is not found, it will default to empty values or 0 for numeric fields, rather than throwing an error.
Remember to ensure that seenHrefs and window.tweets are properly initialized before running this function:
const seenHrefs = new Set();
window.tweets = window.tweets || [];
This approach allows the function to collect as much data as possible from each tweet, even if some parts of the tweet's structure are missing or have changed.
human (Oct 8, 2024, 08:15 PM)
// Ditch any <span>…</span> elements
document
.querySelectorAll("span")
.forEach((span) => span.textContent.trim() === "…" && span.remove());
// Initialize window.tweets as an empty array if it doesn't exist
window.tweets = window.tweets || [];
// Set to store previously seen hrefs
let seenHrefs = new Set();
// Function to collect tweets
function collectTweets() {
const newTweets = Array.from(
document.querySelectorAll("[data-testid=tweet]"),
(el) => {
const safeQuerySelector = (selector) => el.querySelector(selector) || {};
const safeGetAttribute = (element, attr) =>
element.getAttribute ? element.getAttribute(attr) : "";
const safeExtractNumber = (str) => {
const match = str.match(/(\d+)/);
return match ? parseInt(match[0], 10) : 0;
};
const datetime = safeQuerySelector("time").dateTime || "";
const username =
(safeQuerySelector('[data-testid="User-Name"] a').href || "")
.split("/")
.slice(-1)[0] || "";
const tweet =
safeQuerySelector('[data-testid="tweetText"]').innerText || "";
const href = safeQuerySelector("time").closest
? safeQuerySelector("time").closest("a").href
: "";
const likes = safeExtractNumber(
safeGetAttribute(
safeQuerySelector('[data-testid="like"]'),
"aria-label",
),
);
const impressions = safeExtractNumber(
safeGetAttribute(
safeQuerySelector('[aria-label*="View post analytics"]'),
"aria-label",
),
);
const retweets = safeExtractNumber(
safeGetAttribute(
safeQuerySelector('[aria-label*="Repost"]'),
"aria-label",
),
);
return { datetime, username, tweet, href, likes, impressions, retweets };
},
);
// Filter out tweets with previously seen hrefs and add new ones to window.tweets
newTweets.forEach((tweetObj) => {
if (!seenHrefs.has(tweetObj.href)) {
seenHrefs.add(tweetObj.href);
window.tweets.push(tweetObj);
}
});
}
// Run collectTweets every 500ms
setInterval(collectTweets, 500);
// Run copy(window.tweets) later to copy collected tweets to the clipboard
Instead of all that safeX stuff, roll back to my earlier code - write a extractNumber(select) function which attempts to pull the integers out of the aria label of the thing matched by the provided selector and use that for the likes, impressions and retweets
assistant (Oct 8, 2024, 08:15 PM)
Certainly! Here's a revised version of your code that uses an extractNumber function to handle the likes, impressions, and retweets:
// Ditch any <span>…</span> elements
document
.querySelectorAll("span")
.forEach((span) => span.textContent.trim() === "…" && span.remove());
// Initialize window.tweets as an empty array if it doesn't exist
window.tweets = window.tweets || [];
// Set to store previously seen hrefs
let seenHrefs = new Set();
// Function to extract number from aria-label
function extractNumber(el, selector) {
const element = el.querySelector(selector);
if (element && element.getAttribute) {
const match = element.getAttribute('aria-label').match(/(\d+)/);
return match ? parseInt(match[0], 10) : 0;
}
return 0;
}
// Function to collect tweets
function collectTweets() {
const newTweets = Array.from(
document.querySelectorAll("[data-testid=tweet]"),
(el) => {
const datetime = el.querySelector("time")?.dateTime || "";
const username = el.querySelector('[data-testid="User-Name"] a')?.href.split("/").slice(-1)[0] || "";
const tweet = el.querySelector('[data-testid="tweetText"]')?.innerText || "";
const href = el.querySelector("time")?.closest("a")?.href || "";
const likes = extractNumber(el, '[data-testid="like"]');
const impressions = extractNumber(el, '[aria-label*="View post analytics"]');
const retweets = extractNumber(el, '[aria-label*="Repost"]');
return { datetime, username, tweet, href, likes, impressions, retweets };
}
);
// Filter out tweets with previously seen hrefs and add new ones to window.tweets
newTweets.forEach((tweetObj) => {
if (!seenHrefs.has(tweetObj.href)) {
seenHrefs.add(tweetObj.href);
window.tweets.push(tweetObj);
}
});
}
// Run collectTweets every 500ms
setInterval(collectTweets, 500);
// Run copy(window.tweets) later to copy collected tweets to the clipboard
This version makes the following changes:
-
It introduces an
extractNumberfunction that takes the parent element and a selector. This function tries to find the element, get itsaria-labelattribute, extract the number from it, and return the parsed integer or 0 if anything fails. -
The main
collectTweetsfunction now uses optional chaining (?.) for accessing properties that might not exist, providing fallback empty strings where appropriate. -
It uses the new
extractNumberfunction for likes, impressions, and retweets, which simplifies the code and makes it more robust against missing elements or attributes. -
The code still retains the ability to handle cases where elements or attributes are missing, defaulting to empty strings for text fields and 0 for numeric fields.
This approach should give you a good balance between simplicity and robustness. It will attempt to extract the data you need, and if any part fails (due to missing elements or changed structure), it will default to safe values without throwing errors.